常见 Crash / BUG / 优化 查询

机器升级规格后,FE记录的 Cpu Cores 信息不对

Show backens; CpuCores还是记录的原来的32核,实际是64核

          Version: 2.2.8-3edb7b7
           Status: {"lastSuccessReportTabletsTime":"2022-12-11 12:00:45"}
DataTotalCapacity: 4.584 TB
      DataUsedPct: 20.59 %
         CpuCores: 32

Join + bitmap crash

terminate called after throwing an instance of 'std::runtime_error'
  what():  failed memory alloc in constructor
*** Aborted at 1670985104 (unix time) try "date -d @1670985104" if you are using GNU date ***
PC: @     0x7fae92edf387 __GI_raise
*** SIGABRT (@0x3dc0001022a) received by PID 66090 (TID 0x7fae04ae5700) from PID 66090; stack trace: ***
    @          0x3fa3ad2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fae93994630 (unknown)
    @     0x7fae92edf387 __GI_raise
    @     0x7fae92ee0a78 __GI_abort
    @          0x188857d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x59a72a6 __cxxabiv1::__terminate()
    @          0x59a7311 std::terminate()
    @          0x59a74b6 __cxa_rethrow
    @          0x16067b0 _ZNSt8_Rb_treeIjSt4pairIKj7RoaringESt10_Select1stIS3_ESt4lessIjESaIS3_EE7_M_copyINS9_11_Alloc_nodeEEEPSt13_Rb_tree_nodeIS3_EPKSD_PSt18_Rb_tree_node_baseRT_.isra.0.cold
    @          0x209078b starrocks::BitmapValue::BitmapValue()
    @          0x24e859b starrocks::vectorized::ObjectColumn<>::append()
    @          0x24e88b5 starrocks::vectorized::ObjectColumn<>::append_selective()
    @          0x26918f8 starrocks::vectorized::JoinHashMap<>::_copy_build_column()
    @          0x2692099 starrocks::vectorized::JoinHashMap<>::_build_output()
    @          0x26e6289 starrocks::vectorized::JoinHashMap<>::probe()
    @          0x2673478 starrocks::vectorized::JoinHashTable::probe()
    @          0x26646a8 starrocks::vectorized::HashJoinNode::_probe()
    @          0x2665839 starrocks::vectorized::HashJoinNode::get_next()
    @          0x274bd91 starrocks::vectorized::ProjectNode::get_next()
    @          0x2051f23 starrocks::PlanFragmentExecutor::_get_next_internal_vectorized()
    @          0x205375e starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x2054347 starrocks::PlanFragmentExecutor::open()
    @          0x1fd8fab starrocks::FragmentExecState::execute()
    @          0x1fdd5dc starrocks::FragmentMgr::exec_actual()
    @          0x1fdde81 _ZNSt17_Function_handlerIFvvEZN9starrocks11FragmentMgr18exec_plan_fragmentERKNS1_23TExecPlanFragmentParamsERKSt8functionIFvPNS1_20PlanFragmentExecutorEEESC_EUlvE_E9_M_invokeERKSt9_Any_data
    @          0x2132549 starrocks::ThreadPool::dispatch_thread()
    @          0x212e0fa starrocks::Thread::supervise_thread()
    @     0x7fae9398cea5 start_thread
    @     0x7fae92fa796d __clone
    @                0x0 (unknown)

读 Parquet crash

*** Aborted at 1670504382 (unix time) try "date -d @1670504382" if you are using GNU date ***
PC: @          0xcd14fa9 starrocks::parquet::DictDecoder<>::get_dict_values()
*** SIGSEGV (@0x604313bec270) received by PID 155105 (TID 0x7f7e6ae2b700) from PID 331268720; stack trace: ***
    @          0xdb15d42 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f7f4792f3ab os::Linux::chained_handler()
    @     0x7f7f47933efc JVM_handle_linux_signal
    @     0x7f7f47926d48 signalHandler()
    @     0x7f7f46e04630 (unknown)
    @          0xcd14fa9 starrocks::parquet::DictDecoder<>::get_dict_values()
    @          0xccfe8b3 starrocks::parquet::ColumnChunkReader::get_dict_values()
    @          0xccfee2c starrocks::parquet::StoredColumnReader::get_dict_values()
    @          0xccf0520 starrocks::parquet::ScalarColumnReader::get_dict_values()
    @          0xcce43d0 starrocks::parquet::GroupReader::_dict_decode()
    @          0xccdd403 starrocks::parquet::GroupReader::get_next()
    @          0xcca4e6a starrocks::parquet::FileReader::get_next()
    @          0xc96db97 starrocks::vectorized::HdfsParquetScanner::do_get_next()
    @          0xc9414b9 starrocks::vectorized::HdfsScanner::get_next()
    @          0xc825ce6 starrocks::connector::HiveDataSource::get_next()
    @          0x749bd98 starrocks::pipeline::ConnectorChunkSource::_read_chunk()
    @          0x7449923 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
    @          0x63a7ae8 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
    @          0x63ac2fe _ZSt13__invoke_implIvRZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS0_12RuntimeStateEiEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @          0x63ac1ac _ZSt10__invoke_rIvRZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS0_12RuntimeStateEiEUlvE_JEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES8_E4typeEOS9_DpOSA_
    @          0x63ac021 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE_E9_M_invokeERKSt9_Any_data
    @          0x6022428 std::function<>::operator()()
    @          0x63bf8be starrocks::workgroup::ScanExecutor::worker_thread()
    @          0x63bf108 _ZZN9starrocks9workgroup12ScanExecutor10initializeEiENKUlvE_clEv
    @          0x63c0ada _ZSt13__invoke_implIvRZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @          0x63c07a2 _ZSt10__invoke_rIvRZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_JEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES6_E4typeEOS7_DpOS8_
    @          0x63c0317 _ZNSt17_Function_handlerIFvvEZN9starrocks9workgroup12ScanExecutor10initializeEiEUlvE_E9_M_invokeERKSt9_Any_data
    @          0x6022428 std::function<>::operator()()
    @          0xbbe3054 starrocks::FunctionRunnable::run()
    @          0xbbdfe94 starrocks::ThreadPool::dispatch_thread()
    @          0xbbfd420 std::__invoke_impl<>()
    @          0xbbfcd79 std::__invoke<>()
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/15189
  • 问题版本:
    • 2.3.0 ~ 2.3.5
    • 2.4.0 ~ 2.4.2
  • 修复版本:
    • 2.3.6+
    • 2.4.3+
  • 临时规避方法:
  • 问题原因:
    • 一个scan range跨越多个row group, 一个row group有字典, 一个row group没有字典编码

TopN crash

query_id:09e1a166-803f-11ed-b2ae-c4b8b44f4875, fragment_instance:09e1a166-803f-11ed-b2ae-c4b8b44f4875
*** Aborted at 1671524376 (unix time) try "date -d @1671524376" if you are using GNU date ***
PC: @     0x2af0ba23c676 __memcmp_sse4_1
*** SIGSEGV (@0xe4d00000019) received by PID 20674 (TID 0x2af0f4fcd700) from PID 25; stack trace: ***
    @          0x3ff4972 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2af0b97b6630 (unknown)
    @     0x2af0ba23c676 __memcmp_sse4_1
    @          0x27e54d8 _ZSt13__adjust_heapIN9__gnu_cxx17__normal_iteratorIPN9starrocks10vectorized20VerticalColumnSorter16CompactChunkItemINS2_5SliceEEESt6vectorIS7_SaIS7_EEEElS7_NS0_5__ops15_Iter_comp_iterIZNS3_L19sort_and_tie_helperIZNS4_8do_visitIjEENS2_6StatusERKNS3_16BinaryColumnBaseIT_EEEUlRKS7_SO_E_SB_EESH_RKbPKNS3_6ColumnEbRT0_RS9_IhSaIhEESJ_St4pairIiiEbmPmEUlSJ_SV_E_EEEvSJ_SV_SV_T1_T2_
    @          0x2884c07 starrocks::vectorized::VerticalColumnSorter::do_visit<>()
    @          0x2885a76 starrocks::ColumnVisitorAdapter<>::visit()
    @          0x19d1edc starrocks::vectorized::ColumnFactory<>::accept()
    @          0x27dcc5b starrocks::vectorized::sort_vertical_columns()
    @          0x2859c4b starrocks::vectorized::VerticalColumnSorter::do_visit()
    @          0x2859e26 starrocks::ColumnVisitorAdapter<>::visit()
    @          0x252298c starrocks::vectorized::ColumnFactory<>::accept()
    @          0x27dcc5b starrocks::vectorized::sort_vertical_columns()
    @          0x27de21a starrocks::vectorized::sort_vertical_chunks()
    @          0x27529e5 starrocks::vectorized::ChunksSorterTopn::_partial_sort_col_wise()
    @          0x2752e9c starrocks::vectorized::ChunksSorterTopn::_filter_and_sort_data()
    @          0x2756434 starrocks::vectorized::ChunksSorterTopn::_sort_chunks()
    @          0x2756b10 starrocks::vectorized::ChunksSorterTopn::done()
    @          0x27419e5 starrocks::vectorized::ChunksSorter::finish()
    @          0x28ba860 starrocks::pipeline::PartitionSortSinkOperator::set_finishing()
    @          0x28def07 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x28dff3b starrocks::pipeline::PipelineDriver::process()
    @          0x28d67dc starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x21772c9 starrocks::ThreadPool::dispatch_thread()
    @          0x2172e7a starrocks::Thread::supervise_thread()
    @     0x2af0b97aeea5 start_thread
    @     0x2af0ba1cfb0d __clone
    @                0x0 (unknown)
  1. CrossJoin Crash

*** SIGABRT (@0x9219) received by PID 37401 (TID 0x2aaeee9bf700) from PID 37401; stack trace: ***
102
103    @          0x3d0a5d2 google::(anonymous namespace)::FailureSignalHandler()
105    @     0x2aaeb4ebe630 (unknown)
107    @     0x2aaeb580f387 __GI_raise
109    @     0x2aaeb5810a78 __GI_abort
111    @          0x17f69ed _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
113    @          0x56c8086 __cxxabiv1::__terminate()
115    @          0x56c80f1 std::terminate()
117    @          0x56c8244 __cxa_throw
119    @          0x17f85d1 std::__throw_length_error()
121    @          0x196c682 std::vector<>::_M_range_insert<>()
123    @          0x1966ce4 starrocks::vectorized::BinaryColumn::append()
125    @          0x232fd96 starrocks::vectorized::NullableColumn::append()
127    @          0x2677b80 starrocks::pipeline::CrossJoinLeftOperator::_copy_joined_rows_with_index_base_probe()
129    @          0x2678041 starrocks::pipeline::CrossJoinLeftOperator::pull_chunk()
131    @          0x268f05a starrocks::pipeline::PipelineDriver::process()
133    @          0x268525e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
135    @          0x1fb37d9 starrocks::ThreadPool::dispatch_thread()
137    @          0x1faf38a starrocks::Thread::supervise_thread()
139    @     0x2aaeb4eb6ea5 start_thread
141    @     0x2aaeb58d7b0d __clone
143    @                0x0 (unknown)
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/15497
  • 问题版本:
    • 2.3.0 ~ 2.3.6
    • 2.4.0 ~ 2.4.2
  • 修复版本:
    • 2.3.7+
    • 2.4.3+
  • 临时规避方法:
  • 问题原因:
    • 基数估计有问题,导致执行计划生成不合理。
  1. BThread crash

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1672016623 (unix time) try "date -d @1672016623" if you are using GNU date ***
PC: @     0x7fa77afe1387 __GI_raise
*** SIGABRT (@0x3eb00002b0f) received by PID 11023 (TID 0x7fa5b3873700) from PID 11023; stack trace: ***
    @          0x403cce2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa77ba96630 (unknown)
    @     0x7fa77afe1387 __GI_raise
    @     0x7fa77afe2a78 __GI_abort
    @          0x18cd85d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x5a40906 __cxxabiv1::__terminate()
    @          0x5ae49d9 __cxa_call_terminate
    @          0x5a40321 __gxx_personality_v0
    @          0x5aeb62e _Unwind_RaiseException_Phase2
    @          0x5aec126 _Unwind_Resume
    @          0x17e1422 _ZN4brpc6policy17ProcessRpcRequestEPNS_16InputMessageBaseE.cold
    @          0x41654e7 brpc::ProcessInputMessage()
    @          0x4166393 brpc::InputMessenger::OnNewMessages()
    @          0x420d05e brpc::Socket::ProcessEvent()
    @          0x411afef bthread::TaskGroup::task_runner()
    @          0x42a37d1 bthread_make_fcontext

或:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1672390713 (unix time) try "date -d @1672390713" if you are using GNU date ***
PC: @     0x7fac0aea3387 __GI_raise
*** SIGABRT (@0x3f000004a75) received by PID 19061 (TID 0x7fab128b6700) from PID 19061; stack trace: ***
    @          0x4875742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fac0b958630 (unknown)
    @     0x7fac0aea3387 __GI_raise
    @     0x7fac0aea4a78 __GI_abort
    @          0x1c4f8af _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62f82f6 __cxxabiv1::__terminate()
    @          0x62f8361 std::terminate()
    @          0x62f84b4 __cxa_throw
    @          0x1c4f7b6 _Znwm.cold
    @          0x4868341 google::LogMessage::Init()
    @          0x4868a01 google::LogMessage::LogMessage()
    @          0x499ec02 brpc::InputMessenger::OnNewMessages()
    @          0x4a4595e brpc::Socket::ProcessEvent()
    @          0x4953a0f bthread::TaskGroup::task_runner()
    @          0x4adc0d1 bthread_make_fcontext
  1. GroupBY 后 Limit 结果不对,返回行数跳变

  1. 主键模型 Compaction crash

start time: Tue Dec 27 17:12:33 CST 2022
*** Aborted at 1672132354 (unix time) try "date -d @1672132354" if you are using GNU date ***
PC: @          0x17eb7f7 starrocks::vectorized::ChunkHelper::column_from_field()
*** SIGSEGV (@0x0) received by PID 34830 (TID 0x7f181273b700) from PID 0; stack trace: ***
    @          0x34fe482 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f18767605e0 (unknown)
    @          0x17eb7f7 starrocks::vectorized::ChunkHelper::column_from_field()
    @          0x17ebdda starrocks::vectorized::ChunkHelper::new_chunk()
    @          0x18ff0ae starrocks::vectorized::RowsetMergerImpl<>::_do_merge_horizontally()
    @          0x19021b2 starrocks::vectorized::RowsetMergerImpl<>::do_merge()
    @          0x18ef267 starrocks::vectorized::compaction_merge_rowsets()
    @          0x17d88e8 starrocks::TabletUpdates::_do_compaction()
    @          0x17d9999 starrocks::TabletUpdates::compaction()
    @          0x176314c starrocks::StorageEngine::_perform_update_compaction()
    @          0x1757e9f starrocks::StorageEngine::_update_compaction_thread_callback()
    @          0x4fdb870 execute_native_thread_routine
    @     0x7f1876758e25 start_thread
    @     0x7f1875b6234d __clone
    @                0x0 (unknown)
  1. Version already been compacted

 version already been compacted
  1. get_json_int 或 get_json_double crash

erminate called after throwing an instance of 'arangodb::velocypack::Exception'
  what():  Expecting numeric type
query_id:3c5935cf-8299-11ed-8c5b-06d48912a230, fragment_instance:3c5935cf-8299-11ed-8c5b-06d48912a231
*** Aborted at 1671783017 (unix time) try "date -d @1671783017" if you are using GNU date ***
PC: @     0x7fddd4449ca0 __GI_raise
*** SIGABRT (@0xbb2) received by PID 2994 (TID 0x7fdd59562700) from PID 2994; stack trace: ***
    @          0x481e332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fddd4f208e0 (unknown)
    @     0x7fddd4449ca0 __GI_raise
    @     0x7fddd444b148 __GI_abort
    @          0x1c4acef _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a0af6 __cxxabiv1::__terminate()
    @          0x62a0b61 std::terminate()
    @          0x62a0cb4 __cxa_throw
    @          0x3a146e2 starrocks::vectorized::JsonFunctions::_json_query_impl<>()
    @          0x3a0f152 starrocks::vectorized::JsonFunctions::get_native_json_double()
    @          0x39af998 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x3454e2c starrocks::ExprContext::evaluate()
    @          0x2e26eb2 starrocks::pipeline::ProjectOperator::push_chunk()
    @          0x2e7868c starrocks::pipeline::PipelineDriver::process()
    @          0x2e6e5a3 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2680a05 starrocks::ThreadPool::dispatch_thread()
    @          0x267bf2a starrocks::Thread::supervise_thread()
    @     0x7fddd4f1644b start_thread
    @     0x7fddd450556f __GI___clone
    @                0x0 (unknown)
  1. 使用 Hivecatelog 查询 Hive Crash

PC: 0x3510d8a    starrocks::connector::HiveDataSource::_init_scanner()
SIGSEGV (@0x10) received by PID 407690(TID Qx7fodo1763700)from PID 16;stack trace:
     0x403cce2    google::(anonymous namespace)::FailureSignalHandler()
0x7f0d52195852    os::Linux::chainedhandler()
0x7fed5219c676    JVM_handle_linux_signal
0x7f0d52192653    signalHandler()
0x7f39493f05d0   (unknown)
     0x3510d8a    starrocks::vectorized::HdfsScanner::_build_scanner_context()
     0x35117ef    starrocks::vectorized::HdfsScanner::open()
     0x34822eb    starrocks::connector::HiveDatasource::_init_scanner()
     9x3484a33    starrocks::connector::HiveDataSource::open()
     0x28bofdc    starrocks::pipeline::ConnectorChunkSource::_read_chunk()
     0x28b10c3    starrocks::pipeline::ConnectorChunkSource::buffer next batch_chunks_blocking()
     0x28ac22c    _ZNSt17_Function handlerIFvvEZN9starrocks8pipelinel2ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEULVEO_E9_M_invokeERKSt9_Any_date
     0×2011c60    starrocks::PriorityThreadPool::work_thread()
     0x3f92fa7    thread_proxy
0x7f0d51653dd5    start_thread
0x7f0d50c6eead    _clone
           0x0   (unknown)
  1. Not found dict for cid

查询报错

Not found dict for cid
  • Github Issue: 无
  • Github Fix PR: https://github.com/StarRocks/starrocks/pull/13185
  • Jira:
  • 问题版本:
    • 2.2.0 ~ 2.2.9
    • 2.3.0 ~ 2.3.4
    • 2.4.0 ~ 2.4.1
  • 修复版本:
    • 2.2.10+
    • 2.3.5+
    • 2.4.2+
  • 临时规避方法:
    • set global cbo_enable_low_cardinality_optimize=false;
  • 问题原因:
    • 见PR描述
  1. 使用资源组查询卡住

pstack 有如下堆栈

pstack starrocks_be 进程号

#0  0x00007fe5bc2cca35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000005a229bc in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x37cf64bf8) at /var/local/gcc/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::condition_variable::wait (this=this@entry=0x37cf64bf8, __lock=...) at ../../../.././libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
#5  0x00000000028f0305 in starrocks::pipeline::GlobalDriverExecutor::_worker_thread (this=0xa892ee0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_executor.cpp:86
#6  0x000000000217fef9 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:248
#7  starrocks::FunctionRunnable::run (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:44
#8  starrocks::ThreadPool::dispatch_thread (this=0x19a50000) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:513
#9  0x000000000217baaa in std::function<void ()>::operator()() const (this=0x17fa08d8) at /usr/include/c++/10.3.0/bits/std_function.h:248
#10 starrocks::Thread::supervise_thread (arg=0x17fa08c0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/thread.cpp:326
#11 0x00007fe5bc2c8ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fe5bb8e3b0d in clone () from /lib64/libc.so.6

同时有两个take

#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
  1. _statistics.column_statistics 表 StatisticsCollectJob Too many versions

2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateFragmentExecStatus():2174] one instance report fail errorCode SERVICE_UNAVAILABLE Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26, query_id=3772d178-8ca4-11ed-854d-6cfe54388271 instance_id=3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateStatus():1249] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 3772d178-8ca4-11ed-854d-6cfe54388271, instance id: 3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1338] insert failed: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1415] handle insert stmt fail: insert_3772d178-8ca4-11ed-854d-6cfe54388271
com.starrocks.common.DdlException: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
        at com.starrocks.common.ErrorReport.reportDdlException(ErrorReport.java:80) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.handleDMLStmt(StmtExecutor.java:1339) [starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:471) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticsCollectJob.collectStatisticSync(StatisticsCollectJob.java:92) [starrocks-fe.jar:?]
        at com.starrocks.statistic.FullStatisticsCollectJob.collect(FullStatisticsCollectJob.java:62) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticExecutor.collectStatistics(StatisticExecutor.java:190) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticAutoCollector.runAfterCatalogReady(StatisticAutoCollector.java:61) [starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
  1. insert 内存泄漏 (insert 或是 insert into select)

FE Follower 内存泄漏,Leader正常,看内存分布 TxnStateCallbackFactory使用内存比较多

jmap -histo pid


 num     #instances         #bytes  class name
----------------------------------------------
   1:      65039949     6979006048  [C
   2:       4022619     2525925768  [B
   3:      51632292     2478350016  java.util.HashMap
   4:      73877356     1773056544  java.lang.String
   5:      10172354     1546197808  com.starrocks.load.loadv2.InsertLoadJob
   6:      20355243      977051664  com.google.gson.internal.LinkedTreeMap$Node
   7:      20352822      976935456  com.google.gson.internal.LinkedTreeMap
   8:      10727986      935362088  [Ljava.util.HashMap$Node;
   9:      38312936      919510464  java.lang.Long
  10:      10172713      813817488  [Lorg.apache.commons.collections.map.AbstractHashedMap$HashEntry;
  11:      22247256      711912192  java.util.HashMap$Node
  12:      10230960      654781440  java.util.concurrent.ConcurrentHashMap
  13:      10461293      585832408  java.util.LinkedHashMap
  14:      10172712      569671872  com.starrocks.load.EtlStatus
  15:      10172712      569671872  org.apache.commons.collections.map.HashedMap
  16:      10172400      488275200  java.util.concurrent.locks.ReentrantReadWriteLock$FairSync

com.starrocks.load.loadv2.InsertLoadJob 这个占用比较多的,说明是这个问题

  1. Join Reorder Crash

*** Aborted at 1669284722 (unix time) try "date -d @1669284722" if you are using GNU date ***
PC: @          0x319a237 starrocks::serde::ColumnArraySerde::deserialize()
*** SIGSEGV (@0x0) received by PID 24275 (TID 0x7f00a6887700) from PID 0; stack trace: ***
    @          0x3cf85d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f014b602600 (unknown)
    @          0x319a237 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x319c793 starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x1ee0a78 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x1ee1b6b starrocks::DataStreamRecvr::SenderQueue::add_chunks()
    @          0x1ee3a79 starrocks::DataStreamRecvr::add_chunks()
    @          0x1ec3e3d starrocks::DataStreamMgr::transmit_chunk()
    @          0x1f07abc starrocks::PInternalServiceImpl<>::transmit_chunk()
    @          0x3e2a32e brpc::policy::ProcessRpcRequest()
    @          0x3e20d97 brpc::ProcessInputMessage()
    @          0x3e21c43 brpc::InputMessenger::OnNewMessages()
    @          0x3ec890e brpc::Socket::ProcessEvent()
    @          0x3dd689f bthread::TaskGroup::task_runner()
    @          0x3f5f081 bthread_make_fcontext
  1. 主键/uniqeu/Agg 模型SchemaChange不支持Array

mysql> ALTER TABLE test_add_array_column                                                                                                                                                                                                                                          
    -> ADD COLUMN arr2 ARRAY< varchar (65533)>;                                                                                                                                                                                                                                   
ERROR 1064 (HY000): Unexpected exception: ARRAY<VARCHAR(65533)> must be used in DUP_KEYS   
  1. AVX2 不支持导致 Crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1673246592 (unix time) try “date -d @1673246592” if you are using GNU date ***
PC: @ 0x4ab69a4 bitset_container_from_array
*** SIGILL (@0x4ab69a4) received by PID 9331 (TID 0x7f72ef254700) from PID 78342564; stack trace: ***
@ 0x4659ee2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f7309f71630 (unknown)
@ 0x4ab69a4 bitset_container_from_array
@ 0x4a9cc9d roaring_bitmap_add_many
@ 0x2fdd52e starrocks::DelVector::_add_dels()
@ 0x2fddcdc starrocks::DelVector::add_dels_as_new_version()
@ 0x2e5544c starrocks::TabletUpdates::_apply_rowset_commit()
@ 0x2e57482 starrocks::TabletUpdates::do_apply()
@ 0x362cab5 starrocks::ThreadPool::dispatch_thread()
@ 0x3627efa starrocks::thread::supervise_thread()
@ 0x7f7309f69ea5 start_thread
@ 0x7f7309584b0d __clone
@ 0x0 (unknown)
  • 问题原因
    • SIGILL 一般就是BE所在机器不支持 AVX2指令集导致
  • 修复方法
    • 换支持 AVX2 指令集的机器: cat /proc/cpuinfo |grep avx2
    • 关闭 AVX2支持,手动编译 BE

动不动就崩溃,这稳定性堪忧!

  1. Checksum mismatch 错误

Bad page: checksum mismatch (actual=243080401 vs expect=12)
  • 问题原因
    • 一般是磁盘硬件问题,可以查看下 dmesg -T 是否有 I/O 错误: I/O error
[Sat Jan 14 21:30:54 2023] nvme1n1: Write(0x1) @ LBA 174796784, 1016 blocks, Data Transfer Error (sct 0x0 / sc 0x4) DNR 
[Sat Jan 14 21:30:54 2023] blk_update_request: critical target error, dev nvme1n1, sector 174796784 op 0x1:(WRITE) flags 0x4000 phys_seg 127 prio class 0
[Sat Jan 14 21:30:54 2023] EXT4-fs warning (device dm-0): ext4_end_bio:325: I/O error 5 writing to inode 216269336 (offset 8388608 size 8388608 starting block 21849088)
[Sat Jan 14 21:30:54 2023] buffer_io_error: 502 callbacks suppressed
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849088
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849089
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849090
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849091
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849092
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849093
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849094
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849095
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849096
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849097
[Sat Jan 14 21:30:54 2023] JBD2: Detected IO errors while flushing file data on dm-0-8
  • 解决办法
    • 更换磁盘