常见 Crash / BUG / 优化 查询

OlapScanNode use-after-free

*** Aborted at 1666617333 (unix time) try "date -d @1666617333" if you are using GNU date ***
PC: @          0x2b5243b starrocks::ExprContext::close()
*** SIGSEGV (@0x60) received by PID 302338 (TID 0x7f58702a0700) from PID 96; stack trace: ***
    @          0x3cf55d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f5921984630 (unknown)
    @          0x2b5243b starrocks::ExprContext::close()
    @          0x2b535c0 starrocks::Expr::close()
    @          0x27467e6 starrocks::vectorized::TabletScanner::close()
    @          0x2746e78 starrocks::vectorized::TabletScanner::~TabletScanner()
    @          0x246bbd7 _ZZN9starrocks10ObjectPool3addINS_10vectorized13TabletScannerEEEPT_S5_ENUlPvE_4_FUNES6_
    @          0x246b3ff starrocks::vectorized::OlapScanNode::~OlapScanNode()
    @          0x246ba22 starrocks::vectorized::OlapScanNode::~OlapScanNode()
    @          0x1ed7787 std::_Sp_counted_ptr<>::_M_dispose()
    @          0x18e7fda std::_Sp_counted_base<>::_M_release()
    @          0x1ed3e42 starrocks::RuntimeState::~RuntimeState()
    @          0x1e65a22 starrocks::FragmentExecState::~FragmentExecState()
    @          0x1e6edab std::_Sp_counted_ptr<>::_M_dispose()
    @          0x18e7fda std::_Sp_counted_base<>::_M_release()
    @          0x1e66db5

或堆栈打不全

*** SIGSEGV (@0x0) received by PID 3104 (TID 0x7f51b752b700) from PID 0; stack trace: ***
    @          0x3ff7972 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f522347c630 (unknown)
    @                0x0 (unknown)

array_agg crash

*** Aborted at 1670508783 (unix time) try "date -d @1670508783" if you are using GNU date ***
PC: @          0x28f91b5 starrocks::vectorized::NullableAggregateFunctionUnary<>::update_batch_selectively()
*** SIGSEGV (@0xa0) received by PID 4440 (TID 0x7f9a6a737700) from PID 160; stack trace: ***
    @          0x3ca37d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f9ad08a3630 (unknown)
    @          0x28f91b5 starrocks::vectorized::NullableAggregateFunctionUnary<>::update_batch_selectively()
    @          0x26c9cd9 starrocks::Aggregator::compute_batch_agg_states_with_selection()
    @          0x263df4f starrocks::pipeline::AggregateStreamingSinkOperator::_push_chunk_by_auto()
    @          0x26446cd starrocks::pipeline::AggregateStreamingSinkOperator::push_chunk()
    @          0x2627d5d starrocks::pipeline::PipelineDriver::process()
    @          0x261da11 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x1f5a169 starrocks::ThreadPool::dispatch_thread()
    @          0x1f55d1a starrocks::Thread::supervise_thread()
    @     0x7f9ad089bea5 start_thread
    @     0x7f9acfeb6b0d __clone
    @                0x0 (unknown)

主键模型使用 Persistent index 导入 crash

也有可能启动占用大量内存

W0912 16:42:51.123860 2268472 mem_hook.cpp:254] large memory alloc: 103079215105 bytes, stack:
    @          0x4a360eb  malloc
    @          0x7faf765  operator new()
    @          0x41ed62d  std::__cxx11::basic_string<>::_M_mutate()
    @          0x4458b11  std::__cxx11::basic_string<>::resize()
    @          0x44f3f97  starrocks::SliceMutableIndex::load_snapshot()
    @          0x4448f76  starrocks::ShardByLengthMutableIndex::load_snapshot()
    @          0x4459795  starrocks::ShardByLengthMutableIndex::load()
    @          0x44669dd  starrocks::PersistentIndex::_load()
    @          0x4467a3f  starrocks::PersistentIndex::load()
    @          0x446e6c3  starrocks::PersistentIndex::load_from_tablet()
    @          0x4180f92  starrocks::PrimaryIndex::_do_load()
    @          0x41824bf  starrocks::PrimaryIndex::load()
    @          0x426265e  starrocks::	TabletUpdates::_apply_rowset_commit()
    @          0x4266353  starrocks::TabletUpdates::do_apply()
    @          0x4b17465  starrocks::ThreadPool::dispatch_thread()
    @          0x4b11e4a  starrocks::Thread::supervise_thread()
    @     0x7f759126c609  start_thread
    @     0x7f7591030133  clone
    @              (nil)  (unknown)

start time:2022年11月08日屋期二07:23:17csT
terminate called after throwing an instance of 'std:bad_alloc'
what(): std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000,fragment_instance:00000000-0000-0000-0000-000000000000
**Aborted at 1667863398 (unix time)try "date -d @1667863398"if you are using GNU date **
PC:
0x7fd3c5943207 GI_raise
**SIGABRT (@Ox5abc)received by PID 23228 (TID 0x7fd3515fd700)from PID 23228;stack trace:**
0x481e332 google::(anonymous namespace)::FailuresignalHandler()
0x7fd3c63f75d0 (unknown)
0x7fd3c5943207 GI_raise
0x7fd3c59448f8 GI_abort
0x1c4acef_ZN9_gnu_cxx27_verbose_terminate_handlerEv.cold
0x62a0af6 _cxxabivl:_terminate()
0x62a0b61 std::terminate()
0x62a0cb4 __cxa_throw
0x1c4abf6 _Znwm.cold
0x22bab2b starrocks::FixedMutableIndex::load_snapshot()
0x229e9e6 starrocks::shardByLengthMutableIndex::load()
0x22aa9bc starrocks::PersistentIndex::_load()
0x22abe77 starrocks:PersistentIndex::load()
0x22ad821 starrocks:PersistentIndex::load_from_tablet()
0x1ff713c starrocks:PrimaryIndex:_do_load()
0x1ff7edf starrocks::PrimaryIndex::load()
0x1e6df30 starrocks:Tabletupdates::_apply_rowset_commit()
0x1e73bb3 starrocks::Tabletupdates::do_apply()
0x2680af5 starrocks:ThreadPool:dispatch_thread()
0x267bf2a starrocks:Thread:supervise_thread()
0x7fd3c63efdd5 start_thread
0x7fd3c5a0aead clone

我看这个pr cherrypick到2.3的分支失败了,然后就close了,是有什么特殊原因吗?

机器升级规格后,FE记录的 Cpu Cores 信息不对

Show backens; CpuCores还是记录的原来的32核,实际是64核

          Version: 2.2.8-3edb7b7
           Status: {"lastSuccessReportTabletsTime":"2022-12-11 12:00:45"}
DataTotalCapacity: 4.584 TB
      DataUsedPct: 20.59 %
         CpuCores: 32

Join + bitmap crash

terminate called after throwing an instance of 'std::runtime_error'
  what():  failed memory alloc in constructor
*** Aborted at 1670985104 (unix time) try "date -d @1670985104" if you are using GNU date ***
PC: @     0x7fae92edf387 __GI_raise
*** SIGABRT (@0x3dc0001022a) received by PID 66090 (TID 0x7fae04ae5700) from PID 66090; stack trace: ***
    @          0x3fa3ad2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fae93994630 (unknown)
    @     0x7fae92edf387 __GI_raise
    @     0x7fae92ee0a78 __GI_abort
    @          0x188857d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x59a72a6 __cxxabiv1::__terminate()
    @          0x59a7311 std::terminate()
    @          0x59a74b6 __cxa_rethrow
    @          0x16067b0 _ZNSt8_Rb_treeIjSt4pairIKj7RoaringESt10_Select1stIS3_ESt4lessIjESaIS3_EE7_M_copyINS9_11_Alloc_nodeEEEPSt13_Rb_tree_nodeIS3_EPKSD_PSt18_Rb_tree_node_baseRT_.isra.0.cold
    @          0x209078b starrocks::BitmapValue::BitmapValue()
    @          0x24e859b starrocks::vectorized::ObjectColumn<>::append()
    @          0x24e88b5 starrocks::vectorized::ObjectColumn<>::append_selective()
    @          0x26918f8 starrocks::vectorized::JoinHashMap<>::_copy_build_column()
    @          0x2692099 starrocks::vectorized::JoinHashMap<>::_build_output()
    @          0x26e6289 starrocks::vectorized::JoinHashMap<>::probe()
    @          0x2673478 starrocks::vectorized::JoinHashTable::probe()
    @          0x26646a8 starrocks::vectorized::HashJoinNode::_probe()
    @          0x2665839 starrocks::vectorized::HashJoinNode::get_next()
    @          0x274bd91 starrocks::vectorized::ProjectNode::get_next()
    @          0x2051f23 starrocks::PlanFragmentExecutor::_get_next_internal_vectorized()
    @          0x205375e starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x2054347 starrocks::PlanFragmentExecutor::open()
    @          0x1fd8fab starrocks::FragmentExecState::execute()
    @          0x1fdd5dc starrocks::FragmentMgr::exec_actual()
    @          0x1fdde81 _ZNSt17_Function_handlerIFvvEZN9starrocks11FragmentMgr18exec_plan_fragmentERKNS1_23TExecPlanFragmentParamsERKSt8functionIFvPNS1_20PlanFragmentExecutorEEESC_EUlvE_E9_M_invokeERKSt9_Any_data
    @          0x2132549 starrocks::ThreadPool::dispatch_thread()
    @          0x212e0fa starrocks::Thread::supervise_thread()
    @     0x7fae9398cea5 start_thread
    @     0x7fae92fa796d __clone
    @                0x0 (unknown)

读 Parquet crash

*** Aborted at 1670504382 (unix time) try "date -d @1670504382" if you are using GNU date ***
PC: @          0xcd14fa9 starrocks::parquet::DictDecoder<>::get_dict_values()
*** SIGSEGV (@0x604313bec270) received by PID 155105 (TID 0x7f7e6ae2b700) from PID 331268720; stack trace: ***
    @          0xdb15d42 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f7f4792f3ab os::Linux::chained_handler()
    @     0x7f7f47933efc JVM_handle_linux_signal
    @     0x7f7f47926d48 signalHandler()
    @     0x7f7f46e04630 (unknown)
    @          0xcd14fa9 starrocks::parquet::DictDecoder<>::get_dict_values()
    @          0xccfe8b3 starrocks::parquet::ColumnChunkReader::get_dict_values()
    @          0xccfee2c starrocks::parquet::StoredColumnReader::get_dict_values()
    @          0xccf0520 starrocks::parquet::ScalarColumnReader::get_dict_values()
    @          0xcce43d0 starrocks::parquet::GroupReader::_dict_decode()
    @          0xccdd403 starrocks::parquet::GroupReader::get_next()
    @          0xcca4e6a starrocks::parquet::FileReader::get_next()
    @          0xc96db97 starrocks::vectorized::HdfsParquetScanner::do_get_next()
    @          0xc9414b9 starrocks::vectorized::HdfsScanner::get_next()
    @          0xc825ce6 starrocks::connector::HiveDataSource::get_next()
    @          0x749bd98 starrocks::pipeline::ConnectorChunkSource::_read_chunk()
    @          0x7449923 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
    @          0x63a7ae8 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
    @          0x63ac2fe _ZSt13__invoke_implIvRZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS0_12RuntimeStateEiEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @          0x63ac1ac _ZSt10__invoke_rIvRZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS0_12RuntimeStateEiEUlvE_JEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES8_E4typeEOS9_DpOSA_
    @          0x63ac021 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE_E9_M_invokeERKSt9_Any_data
    @          0x6022428 std::function<>::operator()()
    @          0x63bf8be starrocks::workgroup::ScanExecutor::worker_thread()
    @          0x63bf108 _ZZN9starrocks9workgroup12ScanExecutor10initializeEiENKUlvE_clEv
    @          0x63c0ada _ZSt13__invoke_implIvRZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @          0x63c07a2 _ZSt10__invoke_rIvRZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_JEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES6_E4typeEOS7_DpOS8_
    @          0x63c0317 _ZNSt17_Function_handlerIFvvEZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_E9_M_invokeERKSt9_Any_data
    @          0x6022428 std::function<>::operator()()
    @          0xbbe3054 starrocks::FunctionRunnable::run()
    @          0xbbdfe94 starrocks::ThreadPool::dispatch_thread()
    @          0xbbfd420 std::__invoke_impl<>()
    @          0xbbfcd79 std::__invoke<>()
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/15189
  • 问题版本:
    • 2.3.0 ~ 2.3.5
    • 2.4.0 ~ 2.4.2
  • 修复版本:
    • 2.3.6+
    • 2.4.3+
  • 临时规避方法:
  • 问题原因:
    • 一个scan range跨越多个row group, 一个row group有字典, 一个row group没有字典编码

TopN crash

query_id:09e1a166-803f-11ed-b2ae-c4b8b44f4875, fragment_instance:09e1a166-803f-11ed-b2ae-c4b8b44f4875
*** Aborted at 1671524376 (unix time) try "date -d @1671524376" if you are using GNU date ***
PC: @     0x2af0ba23c676 __memcmp_sse4_1
*** SIGSEGV (@0xe4d00000019) received by PID 20674 (TID 0x2af0f4fcd700) from PID 25; stack trace: ***
    @          0x3ff4972 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2af0b97b6630 (unknown)
    @     0x2af0ba23c676 __memcmp_sse4_1
    @          0x27e54d8 _ZSt13__adjust_heapIN9__gnu_cxx17__normal_iteratorIPN9starrocks10vectorized20VerticalColumnSorter16CompactChunkItemINS2_5SliceEEESt6vectorIS7_SaIS7_EEEElS7_NS0_5__ops15_Iter_comp_iterIZNS3_L19sort_and_tie_helperIZNS4_8do_visitIjEENS2_6StatusERKNS3_16BinaryColumnBaseIT_EEEUlRKS7_SO_E_SB_EESH_RKbPKNS3_6ColumnEbRT0_RS9_IhSaIhEESJ_St4pairIiiEbmPmEUlSJ_SV_E_EEEvSJ_SV_SV_T1_T2_
    @          0x2884c07 starrocks::vectorized::VerticalColumnSorter::do_visit<>()
    @          0x2885a76 starrocks::ColumnVisitorAdapter<>::visit()
    @          0x19d1edc starrocks::vectorized::ColumnFactory<>::accept()
    @          0x27dcc5b starrocks::vectorized::sort_vertical_columns()
    @          0x2859c4b starrocks::vectorized::VerticalColumnSorter::do_visit()
    @          0x2859e26 starrocks::ColumnVisitorAdapter<>::visit()
    @          0x252298c starrocks::vectorized::ColumnFactory<>::accept()
    @          0x27dcc5b starrocks::vectorized::sort_vertical_columns()
    @          0x27de21a starrocks::vectorized::sort_vertical_chunks()
    @          0x27529e5 starrocks::vectorized::ChunksSorterTopn::_partial_sort_col_wise()
    @          0x2752e9c starrocks::vectorized::ChunksSorterTopn::_filter_and_sort_data()
    @          0x2756434 starrocks::vectorized::ChunksSorterTopn::_sort_chunks()
    @          0x2756b10 starrocks::vectorized::ChunksSorterTopn::done()
    @          0x27419e5 starrocks::vectorized::ChunksSorter::finish()
    @          0x28ba860 starrocks::pipeline::PartitionSortSinkOperator::set_finishing()
    @          0x28def07 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x28dff3b starrocks::pipeline::PipelineDriver::process()
    @          0x28d67dc starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x21772c9 starrocks::ThreadPool::dispatch_thread()
    @          0x2172e7a starrocks::Thread::supervise_thread()
    @     0x2af0b97aeea5 start_thread
    @     0x2af0ba1cfb0d __clone
    @                0x0 (unknown)
  1. CrossJoin Crash

*** SIGABRT (@0x9219) received by PID 37401 (TID 0x2aaeee9bf700) from PID 37401; stack trace: ***
102
103    @          0x3d0a5d2 google::(anonymous namespace)::FailureSignalHandler()
105    @     0x2aaeb4ebe630 (unknown)
107    @     0x2aaeb580f387 __GI_raise
109    @     0x2aaeb5810a78 __GI_abort
111    @          0x17f69ed _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
113    @          0x56c8086 __cxxabiv1::__terminate()
115    @          0x56c80f1 std::terminate()
117    @          0x56c8244 __cxa_throw
119    @          0x17f85d1 std::__throw_length_error()
121    @          0x196c682 std::vector<>::_M_range_insert<>()
123    @          0x1966ce4 starrocks::vectorized::BinaryColumn::append()
125    @          0x232fd96 starrocks::vectorized::NullableColumn::append()
127    @          0x2677b80 starrocks::pipeline::CrossJoinLeftOperator::_copy_joined_rows_with_index_base_probe()
129    @          0x2678041 starrocks::pipeline::CrossJoinLeftOperator::pull_chunk()
131    @          0x268f05a starrocks::pipeline::PipelineDriver::process()
133    @          0x268525e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
135    @          0x1fb37d9 starrocks::ThreadPool::dispatch_thread()
137    @          0x1faf38a starrocks::Thread::supervise_thread()
139    @     0x2aaeb4eb6ea5 start_thread
141    @     0x2aaeb58d7b0d __clone
143    @                0x0 (unknown)
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/15497
  • 问题版本:
    • 2.3.0 ~ 2.3.6
    • 2.4.0 ~ 2.4.2
  • 修复版本:
    • 2.3.7+
    • 2.4.3+
  • 临时规避方法:
  • 问题原因:
    • 基数估计有问题,导致执行计划生成不合理。
  1. BThread crash

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1672016623 (unix time) try "date -d @1672016623" if you are using GNU date ***
PC: @     0x7fa77afe1387 __GI_raise
*** SIGABRT (@0x3eb00002b0f) received by PID 11023 (TID 0x7fa5b3873700) from PID 11023; stack trace: ***
    @          0x403cce2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa77ba96630 (unknown)
    @     0x7fa77afe1387 __GI_raise
    @     0x7fa77afe2a78 __GI_abort
    @          0x18cd85d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x5a40906 __cxxabiv1::__terminate()
    @          0x5ae49d9 __cxa_call_terminate
    @          0x5a40321 __gxx_personality_v0
    @          0x5aeb62e _Unwind_RaiseException_Phase2
    @          0x5aec126 _Unwind_Resume
    @          0x17e1422 _ZN4brpc6policy17ProcessRpcRequestEPNS_16InputMessageBaseE.cold
    @          0x41654e7 brpc::ProcessInputMessage()
    @          0x4166393 brpc::InputMessenger::OnNewMessages()
    @          0x420d05e brpc::Socket::ProcessEvent()
    @          0x411afef bthread::TaskGroup::task_runner()
    @          0x42a37d1 bthread_make_fcontext

或:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1672390713 (unix time) try "date -d @1672390713" if you are using GNU date ***
PC: @     0x7fac0aea3387 __GI_raise
*** SIGABRT (@0x3f000004a75) received by PID 19061 (TID 0x7fab128b6700) from PID 19061; stack trace: ***
    @          0x4875742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fac0b958630 (unknown)
    @     0x7fac0aea3387 __GI_raise
    @     0x7fac0aea4a78 __GI_abort
    @          0x1c4f8af _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62f82f6 __cxxabiv1::__terminate()
    @          0x62f8361 std::terminate()
    @          0x62f84b4 __cxa_throw
    @          0x1c4f7b6 _Znwm.cold
    @          0x4868341 google::LogMessage::Init()
    @          0x4868a01 google::LogMessage::LogMessage()
    @          0x499ec02 brpc::InputMessenger::OnNewMessages()
    @          0x4a4595e brpc::Socket::ProcessEvent()
    @          0x4953a0f bthread::TaskGroup::task_runner()
    @          0x4adc0d1 bthread_make_fcontext
  1. GroupBY 后 Limit 结果不对,返回行数跳变

  1. 主键模型 Compaction crash

start time: Tue Dec 27 17:12:33 CST 2022
*** Aborted at 1672132354 (unix time) try "date -d @1672132354" if you are using GNU date ***
PC: @          0x17eb7f7 starrocks::vectorized::ChunkHelper::column_from_field()
*** SIGSEGV (@0x0) received by PID 34830 (TID 0x7f181273b700) from PID 0; stack trace: ***
    @          0x34fe482 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f18767605e0 (unknown)
    @          0x17eb7f7 starrocks::vectorized::ChunkHelper::column_from_field()
    @          0x17ebdda starrocks::vectorized::ChunkHelper::new_chunk()
    @          0x18ff0ae starrocks::vectorized::RowsetMergerImpl<>::_do_merge_horizontally()
    @          0x19021b2 starrocks::vectorized::RowsetMergerImpl<>::do_merge()
    @          0x18ef267 starrocks::vectorized::compaction_merge_rowsets()
    @          0x17d88e8 starrocks::TabletUpdates::_do_compaction()
    @          0x17d9999 starrocks::TabletUpdates::compaction()
    @          0x176314c starrocks::StorageEngine::_perform_update_compaction()
    @          0x1757e9f starrocks::StorageEngine::_update_compaction_thread_callback()
    @          0x4fdb870 execute_native_thread_routine
    @     0x7f1876758e25 start_thread
    @     0x7f1875b6234d __clone
    @                0x0 (unknown)
  1. Version already been compacted

 version already been compacted
  1. get_json_int 或 get_json_double crash

erminate called after throwing an instance of 'arangodb::velocypack::Exception'
  what():  Expecting numeric type
query_id:3c5935cf-8299-11ed-8c5b-06d48912a230, fragment_instance:3c5935cf-8299-11ed-8c5b-06d48912a231
*** Aborted at 1671783017 (unix time) try "date -d @1671783017" if you are using GNU date ***
PC: @     0x7fddd4449ca0 __GI_raise
*** SIGABRT (@0xbb2) received by PID 2994 (TID 0x7fdd59562700) from PID 2994; stack trace: ***
    @          0x481e332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fddd4f208e0 (unknown)
    @     0x7fddd4449ca0 __GI_raise
    @     0x7fddd444b148 __GI_abort
    @          0x1c4acef _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a0af6 __cxxabiv1::__terminate()
    @          0x62a0b61 std::terminate()
    @          0x62a0cb4 __cxa_throw
    @          0x3a146e2 starrocks::vectorized::JsonFunctions::_json_query_impl<>()
    @          0x3a0f152 starrocks::vectorized::JsonFunctions::get_native_json_double()
    @          0x39af998 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x3454e2c starrocks::ExprContext::evaluate()
    @          0x2e26eb2 starrocks::pipeline::ProjectOperator::push_chunk()
    @          0x2e7868c starrocks::pipeline::PipelineDriver::process()
    @          0x2e6e5a3 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2680a05 starrocks::ThreadPool::dispatch_thread()
    @          0x267bf2a starrocks::Thread::supervise_thread()
    @     0x7fddd4f1644b start_thread
    @     0x7fddd450556f __GI___clone
    @                0x0 (unknown)
  1. 使用 Hivecatelog 查询 Hive Crash

PC: 0x3510d8a    starrocks::connector::HiveDataSource::_init_scanner()
SIGSEGV (@0x10) received by PID 407690(TID Qx7fodo1763700)from PID 16;stack trace:
     0x403cce2    google::(anonymous namespace)::FailureSignalHandler()
0x7f0d52195852    os::Linux::chainedhandler()
0x7fed5219c676    JVM_handle_linux_signal
0x7f0d52192653    signalHandler()
0x7f39493f05d0   (unknown)
     0x3510d8a    starrocks::vectorized::HdfsScanner::_build_scanner_context()
     0x35117ef    starrocks::vectorized::HdfsScanner::open()
     0x34822eb    starrocks::connector::HiveDatasource::_init_scanner()
     9x3484a33    starrocks::connector::HiveDataSource::open()
     0x28bofdc    starrocks::pipeline::ConnectorChunkSource::_read_chunk()
     0x28b10c3    starrocks::pipeline::ConnectorChunkSource::buffer next batch_chunks_blocking()
     0x28ac22c    _ZNSt17_Function handlerIFvvEZN9starrocks8pipelinel2ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEULVEO_E9_M_invokeERKSt9_Any_date
     0×2011c60    starrocks::PriorityThreadPool::work_thread()
     0x3f92fa7    thread_proxy
0x7f0d51653dd5    start_thread
0x7f0d50c6eead    _clone
           0x0   (unknown)
  1. Not found dict for cid

查询报错

Not found dict for cid
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/13185
  • Jira:
  • 问题版本:
    • 2.2.0 ~ 2.2.9
    • 2.3.0 ~ 2.3.4
    • 2.4.0 ~ 2.4.1
  • 修复版本:
    • 2.2.10+
    • 2.3.5+
    • 2.4.2+
  • 临时规避方法:
    • set global cbo_enable_low_cardinality_optimize=false;
  • 问题原因:
    • 见PR描述
  1. 使用资源组查询卡住

pstack 有如下堆栈

pstack starrocks_be 进程号

#0  0x00007fe5bc2cca35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000005a229bc in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x37cf64bf8) at /var/local/gcc/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::condition_variable::wait (this=this@entry=0x37cf64bf8, __lock=...) at ../../../.././libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
#5  0x00000000028f0305 in starrocks::pipeline::GlobalDriverExecutor::_worker_thread (this=0xa892ee0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_executor.cpp:86
#6  0x000000000217fef9 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:248
#7  starrocks::FunctionRunnable::run (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:44
#8  starrocks::ThreadPool::dispatch_thread (this=0x19a50000) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:513
#9  0x000000000217baaa in std::function<void ()>::operator()() const (this=0x17fa08d8) at /usr/include/c++/10.3.0/bits/std_function.h:248
#10 starrocks::Thread::supervise_thread (arg=0x17fa08c0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/thread.cpp:326
#11 0x00007fe5bc2c8ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fe5bb8e3b0d in clone () from /lib64/libc.so.6

同时有两个take

#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
  1. _statistics.column_statistics 表 StatisticsCollectJob Too many versions

2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateFragmentExecStatus():2174] one instance report fail errorCode SERVICE_UNAVAILABLE Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26, query_id=3772d178-8ca4-11ed-854d-6cfe54388271 instance_id=3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateStatus():1249] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 3772d178-8ca4-11ed-854d-6cfe54388271, instance id: 3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1338] insert failed: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1415] handle insert stmt fail: insert_3772d178-8ca4-11ed-854d-6cfe54388271
com.starrocks.common.DdlException: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
        at com.starrocks.common.ErrorReport.reportDdlException(ErrorReport.java:80) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.handleDMLStmt(StmtExecutor.java:1339) [starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:471) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticsCollectJob.collectStatisticSync(StatisticsCollectJob.java:92) [starrocks-fe.jar:?]
        at com.starrocks.statistic.FullStatisticsCollectJob.collect(FullStatisticsCollectJob.java:62) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticExecutor.collectStatistics(StatisticExecutor.java:190) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticAutoCollector.runAfterCatalogReady(StatisticAutoCollector.java:61) [starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
  1. insert 内存泄漏 (insert 或是 insert into select)

FE Follower 内存泄漏,Leader正常,看内存分布 TxnStateCallbackFactory使用内存比较多

jmap -histo pid


 num     #instances         #bytes  class name
----------------------------------------------
   1:      65039949     6979006048  [C
   2:       4022619     2525925768  [B
   3:      51632292     2478350016  java.util.HashMap
   4:      73877356     1773056544  java.lang.String
   5:      10172354     1546197808  com.starrocks.load.loadv2.InsertLoadJob
   6:      20355243      977051664  com.google.gson.internal.LinkedTreeMap$Node
   7:      20352822      976935456  com.google.gson.internal.LinkedTreeMap
   8:      10727986      935362088  [Ljava.util.HashMap$Node;
   9:      38312936      919510464  java.lang.Long
  10:      10172713      813817488  [Lorg.apache.commons.collections.map.AbstractHashedMap$HashEntry;
  11:      22247256      711912192  java.util.HashMap$Node
  12:      10230960      654781440  java.util.concurrent.ConcurrentHashMap
  13:      10461293      585832408  java.util.LinkedHashMap
  14:      10172712      569671872  com.starrocks.load.EtlStatus
  15:      10172712      569671872  org.apache.commons.collections.map.HashedMap
  16:      10172400      488275200  java.util.concurrent.locks.ReentrantReadWriteLock$FairSync

com.starrocks.load.loadv2.InsertLoadJob 这个占用比较多的,说明是这个问题