麻烦分析下be.out日志,宕机的原因

【详述】麻烦大佬,分析下be宕机的原因? 目前不清楚是什么原因导致的。
集群从2.4.3升级到2.5.3后,出现偶发的be宕机,be.out的日志如下:

I0000 00:00:00.000000 28670 vlog_is_on.cc:197] RAW: Set VLOG level for “" to 10
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/STARROCKS/starrocks-2.5.3-customize/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/STARROCKS/starrocks-2.5.3-customize/be/lib/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
tracker:process consumption: 76996992314
tracker:query_pool consumption: 8817898771
tracker:load consumption: 3120
tracker:metadata consumption: 2343787541
tracker:tablet_metadata consumption: 590234552
tracker:rowset_metadata consumption: 469924768
tracker:segment_metadata consumption: 561709567
tracker:column_metadata consumption: 721918654
tracker:tablet_schema consumption: 5399280
tracker:segment_zonemap consumption: 517604368
tracker:short_key_index consumption: 25434187
tracker:column_zonemap_index consumption: 122473422
tracker:ordinal_index consumption: 376718784
tracker:bitmap_index consumption: 6914800
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 23116872
tracker:schema_change consumption: 0
tracker:column_pool consumption: 2603442132
tracker:page_cache consumption: 54506488368
tracker:update consumption: 2780625349
tracker:chunk_allocator consumption: 1595204240
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1680482168 (unix time) try “date -d @1680482168” if you are using GNU date ***
PC: @ 0x47f1bed starrocks::print_id()
*** SIGSEGV (@0x0) received by PID 28670 (TID 0x7ff7ce23a700) from PID 0; stack trace: ***
@ 0x5769222 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7ffaba4b4235 os::Linux::chained_handler()
@ 0x7ffaba4b9031 JVM_handle_linux_signal
@ 0x7ffaba4ac0c8 signalHandler()
@ 0x7ffab9963630 (unknown)
@ 0x47f1bed starrocks::print_id()
@ 0x508523c starrocks::PInternalServiceImplBase<>::transmit_chunk()
@ 0x59b72ad brpc::policy::ProcessRpcRequest()
@ 0x58f9077 brpc::ProcessInputMessage()
@ 0x58f9f4b brpc::InputMessenger::OnNewMessages()
@ 0x58ea4de brpc::Socket::ProcessEvent()
@ 0x58bd22f bthread::TaskGroup::task_runner()
@ 0x5a04b81 bthread_make_fcontext
start time: Mon Apr 3 08:36:35 CST 2023
I0000 00:00:00.000000 25106 vlog_is_on.cc:197] RAW: Set VLOG level for "
” to 10
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/STARROCKS/starrocks-2.5.3-customize/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/STARROCKS/starrocks-2.5.3-customize/be/lib/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

【背景】无
【业务影响】导致etl任务失败
【StarRocks版本】2.5.3
【集群规模】3fe + 5be
【机器信息】

请问有生成core文件吗?

如何获取core文件呢?

这个挂掉的堆栈可以复现吗? ulimit -a如果结果显示为0 的话表示未开启core, ulimit -c unlimited可以临时开启core文件

ulimit -a 执行后结果:
image

ulimit -c unlimited 这个执行需要重新启动集群吗?我昨天已经执行过这个,但是晚上be重启后,be目录下还是没有生成core文件。

重启be是不会生成,看你截的图现在是没开core的,开启之后只有be发生crash时才可能生成core文件。

一旦触发be宕机,然后重启之后,core就没了是吧?

be宕机->产生core文件,重启不会导致core文件消失,默认core文件生成在be部署根目录。


是be这个文件夹下吗?

常见 Crash / BUG / 优化 查询 这个问题,2.5.4修了

大佬,这个是什么原因导致的呢?目前我们这边是偶发。也不知道什么操作可以触发。有点迷惑

当前StarRocks用了一个第三方库, brpc,brpc使用的是bthread, bthread线程切换的时候,把内存控制逻辑搞乱了,导致触发了std::bad_alloc

我看上面修复的pr,在2.5.3分支里都有了,是不是贴错了?

我们2.5.3还有这个问题,升级到2.5.4就好了

好的,我们也升级下看看,谢谢