【BE节点异常宕机】

Starrocks 2.5.4 5节点集群,3fe,5be,单个节点756G内存,be设置80%内存

be节点异常全部宕机,报错日志如下:
crash 报错:
*** Aborted at 1699525254 (unix time) try “date -d @1699525254” if you are using GNU date ***
PC: @ 0x32558a0 starrocks::vectorized::NullableAggregateFunctionUnary<>::update_batch_selectively()
*** SIGSEGV (@0x20) received by PID 113564 (TID 0x7f440bb1e700) from PID 32; stack trace: ***
@ 0x583b0e2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f46010f0c2d os::Linux::chained_handler()
@ 0x7f46010f66e2 JVM_handle_linux_signal
@ 0x7f46010e85a8 signalHandler()
@ 0x7f45ffb67630 (unknown)
@ 0x32558a0 starrocks::vectorized::NullableAggregateFunctionUnary<>::update_batch_selectively()
@ 0x3064dce starrocks::Aggregator::compute_batch_agg_states_with_selection()
@ 0x2fc18a1 starrocks::pipeline::AggregateStreamingSinkOperator::_push_chunk_by_auto()
@ 0x2fc1b61 starrocks::pipeline::AggregateStreamingSinkOperator::push_chunk()
@ 0x2cb33b6 starrocks::pipeline::PipelineDriver::process()
@ 0x4ed8513 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x48d9e22 starrocks::ThreadPool::dispatch_thread()
@ 0x48d491a starrocks::thread::supervise_thread()
@ 0x7f45ffb5fea5 start_thread
@ 0x7f45ff8888dd __clone
@ 0x0 (unknown

warning报错:
4243514 W1109 18:20:14.148798 52935 agent_task.cpp:292] storage migrate failed. status:Internal error: could not migration because has unfinished txns., signature:19354481
4243515 W1109 18:20:14.876611 115097 exec_state_reporter.cpp:129] Retrying ReportExecStatus: write() send(): Broken pipe
4243516 W1109 18:20:15.140658 115097 exec_state_reporter.cpp:129] Retrying ReportExecStatus: write() send(): Broken pipe
4243517 W1109 18:20:17.470609 117192 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf4c, reason=Cancelled: LimitReach
4243518 W1109 18:20:17.470782 117201 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf45, reason=Cancelled: LimitReach
4243519 W1109 18:20:17.470443 117197 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf2a, reason=Cancelled: LimitReach
4243520 W1109 18:20:17.470993 117203 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf42, reason=Cancelled: LimitReach
4243521 W1109 18:20:17.471222 117192 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf2c, reason=Cancelled: LimitReach
4243522 W1109 18:20:17.471408 117264 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf37, reason=Cancelled: LimitReach
4243523 W1109 18:20:17.471555 117232 fragment_context.cpp:21] [Driver] Canceled, query_id=92df47ce-7ee9-11ee-a995-e8611f3faf25, instance_id=92df47ce-7ee9-11ee-a995-e8611f3faf31, reason=Cancelled: LimitReach
4243524 E1109 18:20:17.525765 117220 sender_queue.cpp:696] Cancelled receiver cannot add_chunk!
4243525 E1109 18:20:17.526042 117263 sender_queue.cpp:696] Cancelled receiver cannot add_chunk!
4243526 W1109 18:20:17.821033 53425 exec_state_reporter.cpp:129] Retrying ReportExecStatus: write() send(): Broken pipe
4243527 W1109 18:20:18.050956 117122 tablet_updates.cpp:1096] wait_for_version slow(3265ms) version:19150.1 tablet:18951027 #version:259 [18904 19150.1@258 19150.1] pending: rowsets:3[id/seg/row/del/byte/compaction]: [1340/1/1519171/277/ 32.03 MB/-838.00 B],[2840/1/1134007/3382/32.74 MB/-261.42 KB],[20304/1/399544/0/13.30 MB/18.70 MB][2840/1/1134007/3382/32.74 MB/-261.42 KB],[20304/1/399544/0/13.30 MB/18.70 MB]
4243528 W1109 18:20:19.745615 114901 mem_hook.cpp:254] large memory alloc: 1153462756 bytes, stack:
4243529 @ 0x48008ab malloc
4243530 @ 0x7cf1885 operator new()
4243531 @ 0x2c85d1e std::vector<>::_M_range_insert<>()
4243532 @ 0x2c89374 starrocks::vectorized::BinaryColumnBase<>::append()
4243533 @ 0x4df99ea starrocks::vectorized::NullableColumn::append()
4243534 @ 0x2d68543 starrocks::vectorized::JoinHashTable::append_chunk()
4243535 @ 0x2d5bcbc starrocks::vectorized::HashJoinNode::open()
4243536 @ 0x2e508fb starrocks::vectorized::ProjectNode::open()
4243537 @ 0x2d5c06d starrocks::vectorized::HashJoinNode::open()
4243538 @ 0x2e508fb starrocks::vectorized::ProjectNode::open()
4243539 @ 0x47e9f64 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
4243540 @ 0x47ec21d starrocks::PlanFragmentExecutor::open()
4243541 @ 0x473e34b starrocks::FragmentExecState::execute()
4243542 @ 0x4744593 starrocks::FragmentMgr::exec_actual()
4243543 @ 0x48d9e22 starrocks::ThreadPool::dispatch_thread()
4243544 @ 0x48d491a starrocks::thread::supervise_thread()
4243545 @ 0x7f45ffb5fea5 start_thread
4243546 @ 0x7f45ff8888dd __clone
4243547 @ (nil) (unknown)
4243548 W1109 18:20:20.859705 114875 mem_hook.cpp:254] large memory alloc: 1165259236 bytes, stack:
4243549 @ 0x48008ab malloc
4243550 @ 0x7cf1885 operator new()
4243551 @ 0x2c85d1e std::vector<>::_M_range_insert<>()
4243552 @ 0x2c89374 starrocks::vectorized::BinaryColumnBase<>::append()
4243553 @ 0x4df99ea starrocks::vectorized::NullableColumn::append()
4243554 @ 0x2d68543 starrocks::vectorized::JoinHashTable::append_chunk()
4243555 @ 0x2d5bcbc starrocks::vectorized::HashJoinNode::open()
4243556 @ 0x2e508fb starrocks::vectorized::ProjectNode::open()
4243557 @ 0x2d5c06d starrocks::vectorized::HashJoinNode::open()
4243558 @ 0x2e508fb starrocks::vectorized::ProjectNode::open()
4243559 @ 0x47e9f64 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
4243560 @ 0x47ec21d starrocks::PlanFragmentExecutor::open()
4243561 @ 0x473e34b starrocks::FragmentExecState::execute()
4243562 @ 0x4744593 starrocks::FragmentMgr::exec_actual()
4243563 @ 0x48d9e22 starrocks::ThreadPool::dispatch_thread()
4243564 @ 0x48d491a starrocks::thread::supervise_thread()
4243565 @ 0x7f45ffb5fea5 start_thread
4243566 @ 0x7f45ff8888dd __clone
4243567 @ (nil) (unknown)

这个大概率是已知问题 有类似的堆栈修复 麻烦您升级到最新的2.5.13版本吧 再进行下验证

是这个问题吗?


看着这个在2.5之前已经修复了呢

有涵盖的好几个堆栈pr 升级到最新版本是合入的 也能规避一下后面已知的问题影响

是否能提供下大致是什么函数引起的吗?
目前可能不太方便升级,想通过规避使用的方式暂时阻止下be宕机

我的也遇到这个问题了,版本是2.5.2,你后来是升级解决了吗?还是怎么搞的?

发下你的be.out

我的be.out里面没什么打印输出,info里面主要这个报错
W0328 17:30:43.039203 32612 mem_hook.cpp:254] large memory alloc: 1195725857 bytes, stack:
@ 0x46fc4db malloc
@ 0x7be9bc5 operator new()
@ 0x7c6206a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7c62a90 std::__cxx11::basic_string<>::_M_replace_aux()
@ 0x2b7f29d apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
@ 0x2b7f3ac apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
@ 0x4852ba9 apache::thrift::TDispatchProcessor::process()
@ 0x570b018 apache::thrift::server::TConnectedClient::run()
@ 0x5703514 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
@ 0x5705d1d apache::thrift::concurrency::thread::threadMain()
@ 0x56eb486 std::thread::_State_impl<>::_M_run()
@ 0x7c64900 execute_native_thread_routine
@ 0x7f79a72f3ea5 (/usr/lib64/libpthread-2.17.so;6527c38c (deleted)+0x7ea4)
@ 0x7f79a690e9fd (/usr/lib64/libc-2.17.so;6527c38c (deleted)+0xfe9fc)
@ (nil) (unknown)

我的be.out里面没什么打印输出,info里面主要这个报错
W0328 17:30:43.039203 32612 mem_hook.cpp:254] large memory alloc: 1195725857 bytes, stack:
@ 0x46fc4db malloc
@ 0x7be9bc5 operator new()
@ 0x7c6206a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7c62a90 std::__cxx11::basic_string<>::_M_replace_aux()
@ 0x2b7f29d apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
@ 0x2b7f3ac apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
@ 0x4852ba9 apache::thrift::TDispatchProcessor::process()
@ 0x570b018 apache::thrift::server::TConnectedClient::run()
@ 0x5703514 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
@ 0x5705d1d apache::thrift::concurrency::thread::threadMain()
@ 0x56eb486 std::thread::_State_impl<>::_M_run()
@ 0x7c64900 execute_native_thread_routine
@ 0x7f79a72f3ea5 (/usr/lib64/libpthread-2.17.so;6527c38c (deleted)+0x7ea4)
@ 0x7f79a690e9fd (/usr/lib64/libc-2.17.so;6527c38c (deleted)+0xfe9fc)
@ (nil) (unknown)
[/quote]

这个修了啊,升级

你dmesg -T 看下是不是oom了。看下 cat /proc/sys/vm/overcommit_memory

,确实是oom呢

用了1G内存就OOM了?

/proc/sys/vm/overcommit_memory

这个改成1

这个我看了一下,现在就是1呢,我现在是2.5.2,想升级到3.1,不知道怎么样呢?

不是版本的问题,是你环境的问题,这个Java进程只用了1G,就OOM了

先升级到2.5.20看看吧

行的,我先升级2.5.20

,日志上说的是1g,但是我看监控里面都激增了