常见 Crash / BUG / 优化 查询

  1. InfoSchemaDb id shouldn’t larger than 10000
2023-02-16 00:00:34,021 ERROR (leaderCheckpointer|75) [Checkpoint.runAfterCatalogReady():106] Exception when generate new image file
java.lang.IllegalStateException: InfoSchemaDb id shouldn’t larger than 10000, please restart your FE server
at com.google.common.base.Preconditions.checkState(Preconditions.java:510) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.server.LocalMetastore.loadCluster(LocalMetastore.java:3598) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.loadImage(GlobalStateMgr.java:1131) ~[starrocks-fe.jar:?]
at com.starrocks.master.Checkpoint.runAfterCatalogReady(Checkpoint.java:87) [starrocks-fe.jar:?]
at com.starrocks.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:61) [starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
  1. 动态分区:创建大量历史分区,并且超时
    当 dynamic_partition.start 设置的比较小时,会创建大量历史分区
  1. 主键模型开启 persistent index 后磁盘空间持续增长
  1. BE 启动加载 persistent index crash

*** Aborted at 1676903379 (unix time) try "date -d @1676903379" if you are using GNU date ***
PC: @          0x310317a starrocks::PersistentIndex::_merge_compaction()
*** SIGFPE (@0x310317a) received by PID 30837 (TID 0x7f8d745fe700) from PID 51392890; stack trace: ***
    @          0x4825332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fe755c565e0 (unknown)
    @          0x310317a starrocks::PersistentIndex::_merge_compaction()
    @          0x3104bc0 starrocks::PersistentIndex::_check_and_flush_l0()
    @          0x31070c0 starrocks::PersistentIndex::commit()
    @          0x2e8936e starrocks::PrimaryIndex::commit()
    @          0x2f5cbf5 starrocks::TabletUpdates::_apply_compaction_commit()
    @          0x2f5e52d starrocks::TabletUpdates::do_apply()
    @          0x3762b55 starrocks::ThreadPool::dispatch_thread()
    @          0x375df8a starrocks::Thread::supervise_thread()
    @     0x7fe755c4ee25 start_thread
    @     0x7fe75526e34d __clone
    @                0x0 (unknown)
  1. Insert overwrite 或 物化视图刷新 导致 FE follower 内存泄漏
  1. 单表物化视图查询 Crash

*** SIGSEGV (@0x0) received by PID 287072 (TID 0x7f99979d7640) from PID 0; stack trace: ***
6    @          0x56ec9c2 google::(anonymous namespace)::FailureSignalHandler()
7    @     0x7f9b214fb1d0 (unknown)
8    @          0x453d51d std::__push_heap<>()
9    @          0x454a1d9 starrocks::vectorized::HeapMergeIterator::fill()
10    @          0x454b046 starrocks::vectorized::HeapMergeIterator::do_get_next()
11    @          0x454b704 starrocks::vectorized::HeapMergeIterator::do_get_next()
12    @          0x43d40c3 starrocks::vectorized::TimedChunkIterator::do_get_next()
13    @          0x43cfbae starrocks::vectorized::AggregateIterator::do_get_next()
14    @          0x43d1254 starrocks::vectorized::AggregateIterator::do_get_next()
15    @          0x43d40c3 starrocks::vectorized::TimedChunkIterator::do_get_next()
16    @          0x40484ae starrocks::vectorized::TabletReader::do_get_next()
17    @          0x4021292 starrocks::vectorized::ProjectionIterator::do_get_next()
18    @          0x2ee880b starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
19    @          0x2ee8eeb starrocks::pipeline::OlapChunkSource::_read_chunk()
20    @          0x2ed888c starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
21    @          0x2c5bce4 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
22    @          0x2c6cd1d starrocks::workgroup::ScanExecutor::worker_thread()
23    @          0x47a3a9d starrocks::ThreadPool::dispatch_thread()
24    @          0x479e82a starrocks::Thread::supervise_thread()
25    @     0x7f9b214f03fb start_thread
26    @     0x7f9b1fd27c23 __GI___clone
27    @                0x0 (unknown)
  1. 主键模型增量 clone 转全量 clone 失败,导致一直 Clone 失败

I0127 00:46:02.271173  7878 snapshot_manager.cpp:112] make primary snapshot tablet:334728 cur_version:65905 missing_version_ranges:65662 timeout:180
I0127 00:46:02.271206  7878 tablet_updates.cpp:2954] get_rowsets_for_snapshot: too many rowsets for incremental clone #rowset:244 #rowset_for_full_clone:1 tablet:334728 #version:1 [65905.1 65905.1@0 65905.1] #pending:0
W0127 00:46:02.271214  7878 agent_server.cpp:308] fail to make_snapshot. tablet_id:334728 msg:Not found: get_rowsets_for_snapshot: too many rowsets for incremental clone #rowset:244 #rowset_for_full_clone:1 tablet:334728 #version:1 [65905.1 65905.1@0 65905.1] #pending:0
  • Github Issue:
  • Github Fix PR:
  • Jira:
  • 问题版本:
    • 2.3.0 ~ 2.3.7
    • 2.4.0 ~ 2.4.3
    • 2.5.0
  • 修复版本:
    • 2.3.8+
    • 2.4.4+
    • 2.5.1+
  • 临时规避方法:
    • 手动删除BE对应的Tablet
  • 问题原因:
    • 见 issue 描述
  1. 低基数BUG: Dict Decode failed

Dict Decode failed, Dict can't take cover all key :0
  1. 使用主键模型后,BE 重启失败

一般表现为进程在,但是端口不通,pstack 进程号 发现线程比较少,CPU 使用1个核打满,

典型的 pstack 堆栈

#14 0x0000000001b2745b in starrocks::TabletUpdates::init() ()
#15 0x0000000001ad711f in starrocks::Tablet::_init_once_action() ()
#16 0x0000000001ad734d in void std::call_once<starrocks::Status starrocks::StarRocksCall0nce<starrocks::Status>::call<starrocks::Tablet::init()::flambda()#1}>(starrocks::Tablet::init()::{lambda()#1})::{lambda()#1}>(std::once_flag&, starrocks::Tablet::init()::{lambda()#1}&&)::{lambda()#2}::_FUN() ()
#17 Θx00007f8fΘe6f3e40 in pthread_once () from /lib64/libpthread.so.Θ
#18 Θx0000000001acad51 in starrocks::Tablet::init() ()
#19 Θx0000000001ae5849 in starrocks::TabletManager::load_tablet_from_meta(starrocks::DataDir*, long, int, std::basic_string_view<char, std::char_traits<c har> >, bool, bool, bool, bool) ()
#20 0x0000000001ac7314 in std::_Function_handler<bool (long, long, std::basic_string_view<char, std::char_traits<char> >), starrocks::DataDir::load()::fl ambda(long, int, std::basic_string_view<char, std::char_traits<char> >)#2)>::_M_invoke(std::_Any_data const&, long&&:, std::_Any_data const&, std::basic_s tring_view<char, std::char_traits<char> >&&) ()
#21 0x0000000001af7535 in std:: Function handler<bool (std::basic string view<char, std::char traits<char> >, std::basic string view<char, std::char trai ts<char> >), starrocks::TabletMetaManager::walk(starrocks::KVStore*, std::function<bool (long, long, std::basic_string_view<char, std::char_traits<char> >)> const&)::{lambda(std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)#1}>::_M_invoke(std::_A ny_data const&, std::basic_string_view<char, std::char_traits<char> >&&, std::_Any_data const&) ()
#22 0x0000000001c95670 in starrocks::KVStore::iterate(starrocks::ColumnFamilyIndex, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocat or<char> > const&, std::function<bool (std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)> con st&) ()
#23 Θx000000000laf9add in starrocks::TabletMetaManager::walk(starrocks::KVStore*, std::function<bool (long, long, std::basic_string_view<char, std::char) traits<char> >> const&) ()
#24 ΘxΘ000000001ac8a01 in starrocks::DataDir::load() ()
#25 0x0000000001ab1b56 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<starrocks::StorageEngine::load_data_dirs(std::vector<starrocks::DataD  ir*, std::allocator<starrocks::DataDir*> > const&)::{lambda()#1}> >>::_M_run() ()
#26 Θx0000000005b5f9f0 in execute_native_thread_routine ()
#27 Θx00007f8f0e6eedd5 in start_thread () from /lib64/libpthread.so.0
#28 Θx00007f8f0dd09ead in clone () from /lib64/libc.so.6
  1. Flink 导入 CSV 到 StarRocks 多了换行符

  1. local_shuffle crash

PC: @   0x4ccb226 starrocks::vectorized::Chunk::clone_empty_with_slot()
*** SIGSEGV (@0x0) received by PID 268757 (TID 0x7fc78adcc700) from PID 0; stack trace: ***
    @   0x5722822 google::(anonynous namespace)::FailureSignalHandler()
    @   0x7fc895dd9630 (unknown)
    @   0x4ccb226 starrocks::vectorized::Chunk::clone_empty_with_slot()
    @   0x4ccb953 starrocks::vectorized::Chunk::clone_empty_with_slot()
    @   0x30ec210 starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
    @   0x30ecd67 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
    @   0x2c3fbe3 starrocks::pipeline::PipelineDriver::process()
    @   0x4dclbd7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @   0x47d2acd starrocks::ThreadPool::dispatch_thread()
    @   0x47cd85a starrocks::Thread::supervise_thread()
    @   0x7fc895ddlea5 start_thread
    @   0x7fc8953ecb0d _clone
    @   0x0 (unknown)
  1. DumpStackTrace Crash

*** Aborted at 1678163575 (unix time) try "date -d @1678163575" if you are using GNU date ***
PC: @     0x7ff0c4cea00b gsignal
*** SIGABRT (@0x2b4) received by PID 692 (TID 0x7fef117b9700) from PID 692; stack trace: ***
    @          0x54d8a82 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7ff0c4ead420 (unknown)
    @     0x7ff0c4cea00b gsignal
    @     0x7ff0c4cc9859 abort
    @          0x29f47de _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x7996116 __cxxabiv1::__terminate()
    @          0x7996181 std::terminate()
    @          0x79962d4 __cxa_throw
    @          0x29f46f6 _Znwm.cold
    @          0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
    @          0x7a1016b std::__cxx11::basic_string<>::_M_append()
    @          0x54d5dbf google::DumpStackTrace()
    @          0x45d1c2c __wrap___cxa_throw
    @          0x29f46f6 _Znwm.cold
    @          0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
    @          0x7a1016b std::__cxx11::basic_string<>::_M_append()
    @          0x54d5dbf google::DumpStackTrace()
    @          0x45d1c2c __wrap___cxa_throw
    @          0x29f46f6 _Znwm.cold
    @          0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
    @          0x7a1016b std::__cxx11::basic_string<>::_M_append()
    @          0x54d5dbf google::DumpStackTrace()
    @          0x45d1c2c __wrap___cxa_throw
    @          0x29f46f6 _Znwm.cold
    @          0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
    @          0x7a1016b std::__cxx11::basic_string<>::_M_append()
    @          0x54d5dbf google::DumpStackTrace()
    @          0x45d1c2c __wrap___cxa_throw
    @          0x29f46f6 _Znwm.cold
    @          0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
    @          0x7a1016b std::__cxx11::basic_string<>::_M_append()
    @          0x54d5dbf google::DumpStackTrace()

这个问题在2.3.10版本也存在,已经提 pr backport了

2赞
  1. 导入失败时大量打Rollback日志

I0320 10:52:20.131407 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 2354181, txn_id: 22394960, tablet: 2354286.1722053141.5247b51b58eaadec-44cacc1b65f51b9d
I0320 10:52:20.131418 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1828787, txn_id: 22394966, tablet: 1829568.820590957.614c3ce0d15a9dd9-f9ef196bf03b2dab
I0320 10:52:20.131428 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1635218, txn_id: 22394960, tablet: 1636052.1722053141.cb423e755483acac-b08fbe036699c887
I0320 10:52:20.131438 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1808762, txn_id: 22394966, tablet: 1808815.820590957.c447965a8662e1af-cfde7a5cd56fca94
I0320 10:52:20.131448 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 5578541, txn_id: 22394960, tablet: 5579122.1722053141.8f49146ca3ac6f35-67c56dd44c2429bf
I0320 10:52:20.131459 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4405613, txn_id: 22394966, tablet: 4405866.820590957.554c400665d2eb88-10d285ea9abd389f
I0320 10:52:20.131469 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4782620, txn_id: 22394960, tablet: 4783417.1722053141.314f677faf599b17-c29c8c1dfa293ab4
I0320 10:52:20.131480 38394 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 6039076, txn_id: 22394966, tablet: 6039273.820590957.7e41fcd0055968d5-693879b6b9cf63bd
I0320 10:52:20.131492 38387 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4223033, txn_id: 22394960, tablet: 4223086.1722053141.724cb54a62fe404a-71ab614633f2878b
I0320 10:52:20.131505 38418 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1836797, txn_id: 22394966, tablet: 1837158.820590957.644cbc3da2faa67b-c63bf158d58e65a4
  1. 主键模型 SchemaChange 后不再触发 Compaction, 导致 Too many versions

可以手动触发,但是无法自动触发

  1. Load zone map index crash

*** Aborted at 1680758973 (unix time) try "date -d @1680758973" if you are using GNU date ***
6PC: @          0x4086310 google::protobuf::Message::SpaceUsedLong()
7*** SIGSEGV (@0x0) received by PID 75495 (TID 0x7fabb41aa700) from PID 0; stack trace: ***
8    @          0x3cb75d2 google::(anonymous namespace)::FailureSignalHandler()
9    @     0x7fabe07085e0 (unknown)
10    @          0x4086310 google::protobuf::Message::SpaceUsedLong()
11    @          0x1c26814 starrocks::ZoneMapIndexReader::mem_usage()
12    @          0x1ba5e80 starrocks::ColumnReader::_load_zonemap_index()
13    @          0x1ba5fad starrocks::ColumnReader::zone_map_filter()
14    @          0x1bf4959 starrocks::ScalarColumnIterator::get_row_ranges_by_zone_map()
15    @          0x1a33c72 starrocks::vectorized::SegmentIterator::_get_row_ranges_by_zone_map()
16    @          0x1a3493f starrocks::vectorized::SegmentIterator::_init()
17    @          0x1a35029 starrocks::vectorized::SegmentIterator::do_get_next()
18    @          0x1a927f2 starrocks::vectorized::ProjectionIterator::do_get_next()
19    @          0x1def65a starrocks::SegmentIteratorWrapper::do_get_next()
20    @          0x1aca8ab starrocks::vectorized::TimedChunkIterator::do_get_next()
21    @          0x1ad0554 starrocks::vectorized::UnionIterator::do_get_next()
22    @          0x1ac330e starrocks::vectorized::TabletReader::do_get_next()
23    @          0x27880bd starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
24    @          0x2788740 starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
25    @          0x278b9e3 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE0_E9_M_invokeERKSt9_Any_data
26    @          0x1e16820 starrocks::PriorityThreadPool::work_thread()
27    @          0x3c52c07 thread_proxy
28    @     0x7fabe0700e25 start_thread
29    @     0x7fabdfd2034d __clone
30    @                0x0 (unknown)
  1. BE 内存泄漏 (LocalExchange)

LocalExchange 内存泄漏导致内存缓慢增长

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 2.2.13
    • 2.3.0 ~ 2.3.11
    • 2.4.0 ~ 2.4.4
    • 2.5.0 ~ 2.5.4
  • 修复版本:
    • 2.2.14+
    • 2.3.12+
    • 2.4.5+
    • 2.5.5+
  • 临时规避方法:
  • 问题原因:
    • 析构函数未定义成虚函数
  1. JDBC 外表查询 Crash

*** Aborted at 1681439883 (unix time) try "date -d @1681439883" if you are using GNU date ***
PC: @     0x7f87e0eb5720 __memcpy_ssse3_back
*** SIGSEGV (@0x7f83fcd49ffd) received by PID 1695 (TID 0x7f862f707700) from PID 18446744073656377341; stack trace: ***
    @          0x5769222 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f87e23749db os::Linux::chained_handler()
    @     0x7f87e23794bc JVM_handle_linux_signal
    @     0x7f87e236c378 signalHandler()
    @     0x7f87e184a630 (unknown)
    @     0x7f87e0eb5720 __memcpy_ssse3_back
    @          0x2c38092 starrocks::vectorized::BinaryColumnBase<>::append_selective()
    @          0x4d29e93 starrocks::vectorized::NullableColumn::append_selective()
    @          0x4d0d42a starrocks::vectorized::Chunk::append_selective()
    @          0x310b6ee starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
    @          0x310bfc7 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
    @          0x2c57583 starrocks::pipeline::PipelineDriver::process()
    @          0x4e075e7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x4812f8d starrocks::ThreadPool::dispatch_thread()
    @          0x480dd1a starrocks::Thread::supervise_thread()
    @     0x7f87e1842ea5 start_thread
    @     0x7f87e0e5db0d __clone
    @                0x0 (unknown)
  1. Multi distinct rewrite 报错

(1064, 'There are multi count(distinct) function call, multi distinct rewrite error')
  1. Persistent compaction crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1684761824 (unix time) try "date -d @1684761824" if you are using GNU date ***
PC: @          0x315124a starrocks::PersistentIndex::_merge_compaction()
*** SIGFPE (@0x315124a) received by PID 19356 (TID 0x7f792b578700) from PID 51712586; stack trace: ***
    @          0x4877742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f794dda7630 (unknown)
    @          0x315124a starrocks::PersistentIndex::_merge_compaction()
    @          0x315529e starrocks::PersistentIndex::commit()
    @          0x2ed2c8e starrocks::PrimaryIndex::commit()
    @          0x2fa6786 starrocks::TabletUpdates::_apply_rowset_commit()
    @          0x2fa9023 starrocks::TabletUpdates::do_apply()
    @          0x37a9945 starrocks::ThreadPool::dispatch_thread()
    @          0x37a4d7a starrocks::Thread::supervise_thread()
    @     0x7f794dd9fea5 start_thread
    @     0x7f794d3ba8dd __clone
    @                0x0 (unknown)