BE节点全部宕机-非已知问题

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】BE节点全部宕机
【背景】StarRocks基本都是默认配置,可能是SQL查询导致,但未定位到具体的SQL
【业务影响】
【是否存算分离】否
【StarRocks版本】例如:3.1.9
【集群规模】例如:3fe(1leader+2follower)+6be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:32C/256G/万兆
【联系方式】StarRocks社区群17 听枫丶
【附件】

  • fe.log/beINFO/相应截图
  • 慢查询:
    • Profile信息
    • 并行度: parallel_fragment_exec_instance_num 1
    • pipeline是否开启:show variables like ‘%pipeline%’;
      |enable_pipeline_engine|true|
      |enable_pipeline_query_statistic|true|
      |max_pipeline_dop|64|
      |pipeline_dop|0|
      |pipeline_profile_level|1|
      |pipeline_sink_dop|0|
  • be节点cpu和内存使用率截图:内存CPU正常 无告警be.out.txt (855.9 KB)

查看所有be.out,总结报错如下:
*** Aborted at 1716369559 (unix time) try “date -d @1716369559” if you are using GNU date ***
PC: @ 0x7fa0b9fb5aa0 __memset_sse2
*** SIGSEGV (@0x7f9f42bf1000) received by PID 71755 (TID 0x7fa017016700) from PID 1119817728; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fa0bac2b630 (unknown)
@ 0x7fa0b9fb5aa0 __memset_sse2
@ 0x32b5f5c _ZN9starrocks13DecimalV3Cast9to_stringInEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNSt9enable_ifIX29is_underlying_type_of_decimalIT_EES9_E4typeEii
@ 0x32b60ab starrocks::DecimalV3Column<>::put_mysql_row_buffer()
@ 0x5652f0c starrocks::MysqlResultWriter::process_chunk()
@ 0x35c7185 starrocks::pipeline::ResultSinkOperator::push_chunk()
@ 0x3668358 starrocks::pipeline::PipelineDriver::process()
@ 0x365891e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2cf8f5a starrocks::ThreadPool::dispatch_thread()
@ 0x2cf39ea starrocks::thread::supervise_thread()
@ 0x7fa0bac23ea5 start_thread
@ 0x7fa0ba024b0d __clone
@ 0x0 (unknown)
start time: Wed May 22 17:37:07 CST 2024

*** Aborted at 1716367609 (unix time) try “date -d @1716367609” if you are using GNU date ***
PC: @ 0x7f6a4e862296 __memcmp_sse4_1
*** SIGSEGV (@0x0) received by PID 472418 (TID 0x7f69a840f700) from PID 0; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f6a4ff2e1a2 os::Linux::chained_handler()
@ 0x7f6a4ff34826 JVM_handle_linux_signal
@ 0x7f6a4ff2ae13 signalHandler()
@ 0x7f6a4f3fc630 (unknown)
@ 0x7f6a4e862296 __memcmp_sse4_1
@ 0x2ad6327 std::_Rb_tree<>::_M_get_insert_unique_pos()
@ 0x2cc0b3c std::_Rb_tree<>::_M_insert_unique<>()
@ 0x2cb8345 starrocks::RuntimeProfile::add_counter_unlock()
@ 0x2cb84c5 starrocks::RuntimeProfile::add_child_counter()
@ 0x35b1920 starrocks::pipeline::Operator::close()
@ 0x3666947 starrocks::pipeline::PipelineDriver::_mark_operator_closed()
@ 0x3666ca1 starrocks::pipeline::PipelineDriver::_close_operators()
@ 0x366708a starrocks::pipeline::PipelineDriver::finalize()
@ 0x3658fac starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2cf8f5a starrocks::ThreadPool::dispatch_thread()
@ 0x2cf39ea starrocks::thread::supervise_thread()
@ 0x7f6a4f3f4ea5 start_thread
@ 0x7f6a4e7f5b0d __clone
@ 0x0 (unknown)
start time: Wed May 22 17:04:07 CST 2024

*** Aborted at 1716372825 (unix time) try “date -d @1716372825” if you are using GNU date ***
PC: @ 0x6897470 (unknown)
*** SIGSEGV (@0x0) received by PID 11286 (TID 0x7f5d261f0700) from PID 0; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f5dd4182630 (unknown)
@ 0x6897470 (unknown)
@ 0x2a51f6f starrocks::Status::to_protobuf()
@ 0x2bc46e1 starrocks::GetResultBatchCtx::on_data()
@ 0x2bc56cc starrocks::BufferControlBlock::try_add_batch()
@ 0x5655c0a starrocks::StatisticResultWriter::try_add_batch()
@ 0x35c76b4 starrocks::pipeline::ResultSinkOperator::push_chunk()
@ 0x3668358 starrocks::pipeline::PipelineDriver::process()
@ 0x365891e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2cf8f5a starrocks::ThreadPool::dispatch_thread()
@ 0x2cf39ea starrocks::thread::supervise_thread()
@ 0x7f5dd417aea5 start_thread
@ 0x7f5dd357bb0d __clone
@ 0x0 (unknown)
start time: Wed May 22 18:18:46 CST 2024

*** Aborted at 1716369602 (unix time) try “date -d @1716369602” if you are using GNU date ***
PC: @ 0x2c7e7f2 starrocks::compression::getLZ4F_DCtx()
*** SIGSEGV (@0x0) received by PID 614571 (TID 0x7f4d0bfcd700) from PID 0; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f4e8d452630 (unknown)
@ 0x2c7e7f2 starrocks::compression::getLZ4F_DCtx()
@ 0x2c79339 starrocks::Lz4fBlockCompression::_decompress()
@ 0x2c79b32 starrocks::Lz4fBlockCompression::decompress()
@ 0x5407c6b starrocks::PageIO::read_and_decompress_page()
@ 0x53ae923 starrocks::ColumnReader::read_page()
@ 0x53f63d1 starrocks::ScalarColumnIterator::_read_data_page()
@ 0x53f7c80 starrocks::ScalarColumnIterator::_load_next_page()
@ 0x53f8125 starrocks::ScalarColumnIterator::next_batch()
@ 0x4f264c3 starrocks::SegmentIterator::_read()
@ 0x4f1a594 starrocks::SegmentIterator::_do_get_next()
@ 0x4f1cf40 starrocks::SegmentIterator::do_get_next()
@ 0x4f98300 starrocks::MaskMergeIterator::fill()
@ 0x4fa11c4 starrocks::MaskMergeIterator::do_get_next()
@ 0x50172f4 starrocks::RowsetMergerImpl<>::_do_merge_vertically()
@ 0x5018c84 starrocks::RowsetMergerImpl<>::do_merge()
@ 0x5002a2f starrocks::compaction_merge_rowsets()
@ 0x4ec0b63 starrocks::TabletUpdates::_do_compaction()
@ 0x4ec22bb starrocks::TabletUpdates::compaction()
@ 0x4e089bb starrocks::StorageEngine::_perform_update_compaction()
@ 0x4d9f374 starrocks::StorageEngine::_update_compaction_thread_callback()
@ 0x8939360 execute_native_thread_routine
@ 0x7f4e8d44aea5 start_thread
@ 0x7f4e8c84bb0d __clone
@ 0x0 (unknown)
start time: Wed May 22 17:38:08 CST 2024

*** Aborted at 1716367631 (unix time) try “date -d @1716367631” if you are using GNU date ***
PC: @ 0x7f67a13e2296 __memcmp_sse4_1
*** SIGSEGV (@0x0) received by PID 192620 (TID 0x7f6701295700) from PID 0; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f67a2aae1a2 os::Linux::chained_handler()
@ 0x7f67a2ab4826 JVM_handle_linux_signal
@ 0x7f67a2aaae13 signalHandler()
@ 0x7f67a1f7c630 (unknown)
@ 0x7f67a13e2296 __memcmp_sse4_1
@ 0x2ad6327 std::_Rb_tree<>::_M_get_insert_unique_pos()
@ 0x2cc0b3c std::_Rb_tree<>::_M_insert_unique<>()
@ 0x2cb8345 starrocks::RuntimeProfile::add_counter_unlock()
@ 0x2cb84c5 starrocks::RuntimeProfile::add_child_counter()
@ 0x35b1920 starrocks::pipeline::Operator::close()
@ 0x3666947 starrocks::pipeline::PipelineDriver::_mark_operator_closed()
@ 0x3666ca1 starrocks::pipeline::PipelineDriver::_close_operators()
@ 0x366708a starrocks::pipeline::PipelineDriver::finalize()
@ 0x3658fac starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2cf8f5a starrocks::ThreadPool::dispatch_thread()
@ 0x2cf39ea starrocks::thread::supervise_thread()
@ 0x7f67a1f74ea5 start_thread
@ 0x7f67a1375b0d __clone
@ 0x0 (unknown)

crash report 的 query_id 多次执行也无法复现吗?

会复现,具体见附件,谢谢复现SQL_1.txt (34.8 KB) be.out.txt (855.9 KB) dump_file (22.3 KB)

ok,已添加

1赞

PC: @ 0x2c7e7f2 starrocks::compression::getLZ4F_DCtx()
*** SIGSEGV (@0x0) received by PID 614571 (TID 0x7f4d0bfcd700) from PID 0; stack trace: ***
@ 0x64fe742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f4e8d452630 (unknown)
@ 0x2c7e7f2 starrocks::compression::getLZ4F_DCtx()
@ 0x2c79339 starrocks::Lz4fBlockCompression::_decompress()
@ 0x2c79b32 starrocks::Lz4fBlockCompression::decompress()
@ 0x5407c6b starrocks::PageIO::read_and_decompress_page()
@ 0x53ae923 starrocks::ColumnReader::read_page()
@ 0x53f63d1 starrocks::ScalarColumnIterator::_read_data_page()
@ 0x53f7c80 starrocks::ScalarColumnIterator::_load_next_page()
@ 0x53f8125 starrocks::ScalarColumnIterator::next_batch()
@ 0x4f264c3 starrocks::SegmentIterator::_read()
@ 0x4f1a594 starrocks::SegmentIterator::_do_get_next()
@ 0x4f1cf40 starrocks::SegmentIterator::do_get_next()
@ 0x4f98300 starrocks::MaskMergeIterator::fill()
@ 0x4fa11c4 starrocks::MaskMergeIterator::do_get_next()
@ 0x50172f4 starrocks::RowsetMergerImpl<>::_do_merge_vertically()
@ 0x5018c84 starrocks::RowsetMergerImpl<>::do_merge()
@ 0x5002a2f starrocks::compaction_merge_rowsets()
@ 0x4ec0b63 starrocks::TabletUpdates::_do_compaction()
@ 0x4ec22bb starrocks::TabletUpdates::compaction()
@ 0x4e089bb starrocks::StorageEngine::_perform_update_compaction()
@ 0x4d9f374 starrocks::StorageEngine::_update_compaction_thread_callback()
@ 0x8939360 execute_native_thread_routine
@ 0x7f4e8d44aea5 start_thread
@ 0x7f4e8c84bb0d __clone
@ 0x0 (unknown) 我们是这个报错,也是这个PR修了么

不是,你是哪个版本?

我们用的v3.1.5,有修复的PR么,麻烦把PR链接发下看下么

先升到3.1的最新版本看看?

生产集群,不太方便总升级,想看看有没有PR,给合过来