3.1.5 存算一体集群BE 挂了

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】否
【StarRocks版本】例如:3.1.5
【集群规模】例如:1fe + 7 be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群3-杨荣
【附件】

  • fe.log/beINFO/相应截图

3.1.5 RELEASE (build 5d8438a)
*** Aborted at 1702696788 (unix time) try “date -d @1702696788” if you are using GNU date ***
PC: @ 0x4e536dc starrocks::DataStreamRecvr::SenderQueue::_build_chunk_meta()
*** SIGSEGV (@0x100646973) received by PID 2030 (TID 0x7fe26af33700) from PID 6580595; stack trace: ***
@ 0x63911c2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fe33eadd44f os::Linux::chained_handler()
@ 0x7fe33eae32b8 JVM_handle_linux_signal
@ 0x7fe33ead4cb8 signalHandler()
@ 0x7fe33dc6e630 (unknown)
@ 0x4e536dc starrocks::DataStreamRecvr::SenderQueue::_build_chunk_meta()
@ 0x4e547ae starrocks::DataStreamRecvr::PipelineSenderQueue::try_to_build_chunk_meta()
@ 0x4e5faeb starrocks::DataStreamRecvr::PipelineSenderQueue::add_chunks<>()
@ 0x4e57012 starrocks::DataStreamRecvr::PipelineSenderQueue::add_chunks()
@ 0x4dc695b starrocks::DataStreamRecvr::add_chunks()
@ 0x4d5fa0f starrocks::DataStreamMgr::transmit_chunk()
@ 0x598c05c starrocks::PInternalServiceImplBase<>::_transmit_chunk()
@ 0x4d8b920 starrocks::PriorityThreadPool::work_thread()
@ 0x6350b87 thread_proxy
@ 0x7fe33dc66ea5 start_thread
@ 0x7fe33d067b0d __clone
@ 0x0 (unknown)

3.1.5 RELEASE (build 5d8438a)
*** Aborted at 1702707180 (unix time) try “date -d @1702707180” if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 18375 (TID 0x7f1789d29700) from PID 0; stack trace: ***
@ 0x63911c2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f1859712630 (unknown)
@ 0x0 (unknown)
start time: Sat Dec 16 14:14:33 CST 2023

这个堆栈生成了core 文件

第一个堆栈看着跟reorder join的问题类似

3.1.5 RELEASE (build 5d8438a)
*** Aborted at 1702714168 (unix time) try “date -d @1702714168” if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 8867 (TID 0x7f616f734700) from PID 0; stack trace: ***
@ 0x63911c2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f6245fca44f os::Linux::chained_handler()
@ 0x7f6245fd02b8 JVM_handle_linux_signal
@ 0x7f6245fc1cb8 signalHandler()
@ 0x7f624515b630 (unknown)
@ 0x0 (unknown)
start time: Sat Dec 16 16:09:43 CST 2023

这个堆栈出现频率也比较搞,暂时没有core file,给一个集群放开了ulimit 参数,再次触发这个问题后再看看会不会生成core 文件

先参考 [问题排查]BE Crash 确认配置都对着没

已经修复了,升级到3.1.6