be crash insert数据从parquet文件到iceberg

【详述】今天下午跑一个load任务的时候be crash,后续任务重新执行也是正常的,不确定问题出在哪里
看报错的代码是内存管理的地方报错


【业务影响】
【是否存算分离】存算一体
【StarRocks版本】3.3.12
【联系方式】社区群22 hpp
【附件】
be_crash.out (127.1 KB)

看日志应该是oom了 生成了core dump

今天又出现了类似的重启
branch-3.3.12 RELEASE (build 9a77cc1)
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
tracker:process consumption: 40671655487
tracker:jemalloc_metadata consumption: 999685648
tracker:query_pool consumption: 23536826719
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 4649462958
tracker:metadata consumption: 7428776131
tracker:tablet_metadata consumption: 670332385
tracker:rowset_metadata consumption: 224298635
tracker:segment_metadata consumption: 927187131
tracker:column_metadata consumption: 5606957980
tracker:tablet_schema consumption: 3913233
tracker:segment_zonemap consumption: 880415201
tracker:short_key_index consumption: 2598174
tracker:column_zonemap_index consumption: 1863928372
tracker:ordinal_index consumption: 1231867128
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 0
tracker:jit_cache consumption: 9152
tracker:update consumption: 2193126
tracker:chunk_allocator consumption: 0
tracker:passthrough consumption: 0
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 50218973
tracker:replication consumption: 0
F20250728 11:13:35.531316 140034926675520 tablet_updates.cpp:930] submit apply task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:287757387 #version:14 [199 208@12 209] pending: rowsets:3
11 [seg:1 row:23 del:0 bytes:7944 row_size:0 compaction_score:268427512 compaction_level:-1 partial_update_by_column:false]
12 [seg:0 row:0 del:0 bytes:0 row_size:0 compaction_score:268435456 compaction_level:-1 partial_update_by_column:false]
13 [seg:0 row:0 del:0 bytes:0 row_size:0 compaction_score:268435456 compaction_level:-1 partial_update_by_column:false]
F20250728 11:13:35.531316 140034926675520 tablet_updates.cpp:930] submit apply task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:287757387 #version:14 [199 208@12 209] pending: rowsets:3
11 [seg:1 row:23 del:0 bytes:7944 row_size:0 compaction_score:268427512 compaction_level:-1 partial_update_by_column:false]
12 [seg:0 row:0 del:0 bytes:0 row_size:0 compaction_score:268435456 compaction_level:-1 partial_update_by_column:false]
13 [seg:0 row:0 del:0 bytes:0 row_size:0 compaction_score:268435456 compaction_level:-1 partial_update_by_column:false]F20250728 11:13:41.998206 140139213633088 tablet_updates.cpp:930] submit apply task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:287757391 #version:14 [199.1 208@12 208.1] pending: rowsets:1
279 [seg:1 row:19 del:0 bytes:7662 row_size:0 compaction_score:268427794 compaction_level:-1 partial_update_by_column:false]
*** Aborted at 1753672379 (unix time) try “date -d @1753672379” if you are using GNU date ***
PC: @ 0x4fce01b starrocks::CurrentThread::MemCacheManager::commit(bool)
*** SIGSEGV (@0x0) received by PID 25 (TID 0x7f751216b640) from PID 0; stack trace: ***
@ 0x7f76365b9ee8 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x99ee7)
@ 0x9ab8a69 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f7636562520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
@ 0x4fce01b starrocks::CurrentThread::MemCacheManager::commit(bool)
@ 0x7430d22 std::_Function_handler<void (), starrocks::io::AsyncFlushOutputStream::close()::{lambda()#1}>::_M_invoke(std::_Any_data const&)
@ 0x4f9e1ad starrocks::PriorityThreadPool::work_thread(int)
@ 0x9a6b59b thread_proxy
@ 0x7f76365b4ac3 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x94ac2)
@ 0x7f7636645a04 clone
start time: Mon Jul 28 11:18:00 CST 2025, server uptime: 11:18:00 up 30 days, 9:17, 0 users, load average: 130.92, 153.06, 80.37

@trueeyu 大佬这个可以帮忙看下嘛,从代码看可能是写parqeut文件的时候内存指针的问题

可能是mmap 数量太高,可以尝试关闭 JIT 功能

set global jit_level = 0

昨天关闭重启后今天还是有重启问题

最近几天突然的重启很频繁,从日志上看集群的内存负载也挺正常,目前我们的逻辑是通过sr写到iceberg中异常的query_id是一个insert into iceberg from parquetFile,从审计日志看是正常跑完的



image

发现报错的行数都是<1024的parquet文件 可能跟数据行数有关?

容易复现吗,给我一个复现方法?我大概知道是哪里的原因

看了下像是sink_io异步线程池任务管理的问题,本地前两天用几行的parqeut文件1000并发跑没复现出来。
目前先临时调大sink_io异步线程池8->24再观察下执行情况

大佬 这个问题有计划修复吗~