常见 Crash / BUG / 优化 查询

这个问题在2.3.10版本也存在,已经提 pr backport了

2赞
  1. 导入失败时大量打Rollback日志

I0320 10:52:20.131407 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 2354181, txn_id: 22394960, tablet: 2354286.1722053141.5247b51b58eaadec-44cacc1b65f51b9d
I0320 10:52:20.131418 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1828787, txn_id: 22394966, tablet: 1829568.820590957.614c3ce0d15a9dd9-f9ef196bf03b2dab
I0320 10:52:20.131428 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1635218, txn_id: 22394960, tablet: 1636052.1722053141.cb423e755483acac-b08fbe036699c887
I0320 10:52:20.131438 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1808762, txn_id: 22394966, tablet: 1808815.820590957.c447965a8662e1af-cfde7a5cd56fca94
I0320 10:52:20.131448 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 5578541, txn_id: 22394960, tablet: 5579122.1722053141.8f49146ca3ac6f35-67c56dd44c2429bf
I0320 10:52:20.131459 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4405613, txn_id: 22394966, tablet: 4405866.820590957.554c400665d2eb88-10d285ea9abd389f
I0320 10:52:20.131469 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4782620, txn_id: 22394960, tablet: 4783417.1722053141.314f677faf599b17-c29c8c1dfa293ab4
I0320 10:52:20.131480 38394 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 6039076, txn_id: 22394966, tablet: 6039273.820590957.7e41fcd0055968d5-693879b6b9cf63bd
I0320 10:52:20.131492 38387 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4223033, txn_id: 22394960, tablet: 4223086.1722053141.724cb54a62fe404a-71ab614633f2878b
I0320 10:52:20.131505 38418 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1836797, txn_id: 22394966, tablet: 1837158.820590957.644cbc3da2faa67b-c63bf158d58e65a4
  1. 主键模型 SchemaChange 后不再触发 Compaction, 导致 Too many versions

可以手动触发,但是无法自动触发

  1. Load zone map index crash

*** Aborted at 1680758973 (unix time) try "date -d @1680758973" if you are using GNU date ***
6PC: @          0x4086310 google::protobuf::Message::SpaceUsedLong()
7*** SIGSEGV (@0x0) received by PID 75495 (TID 0x7fabb41aa700) from PID 0; stack trace: ***
8    @          0x3cb75d2 google::(anonymous namespace)::FailureSignalHandler()
9    @     0x7fabe07085e0 (unknown)
10    @          0x4086310 google::protobuf::Message::SpaceUsedLong()
11    @          0x1c26814 starrocks::ZoneMapIndexReader::mem_usage()
12    @          0x1ba5e80 starrocks::ColumnReader::_load_zonemap_index()
13    @          0x1ba5fad starrocks::ColumnReader::zone_map_filter()
14    @          0x1bf4959 starrocks::ScalarColumnIterator::get_row_ranges_by_zone_map()
15    @          0x1a33c72 starrocks::vectorized::SegmentIterator::_get_row_ranges_by_zone_map()
16    @          0x1a3493f starrocks::vectorized::SegmentIterator::_init()
17    @          0x1a35029 starrocks::vectorized::SegmentIterator::do_get_next()
18    @          0x1a927f2 starrocks::vectorized::ProjectionIterator::do_get_next()
19    @          0x1def65a starrocks::SegmentIteratorWrapper::do_get_next()
20    @          0x1aca8ab starrocks::vectorized::TimedChunkIterator::do_get_next()
21    @          0x1ad0554 starrocks::vectorized::UnionIterator::do_get_next()
22    @          0x1ac330e starrocks::vectorized::TabletReader::do_get_next()
23    @          0x27880bd starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
24    @          0x2788740 starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
25    @          0x278b9e3 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE0_E9_M_invokeERKSt9_Any_data
26    @          0x1e16820 starrocks::PriorityThreadPool::work_thread()
27    @          0x3c52c07 thread_proxy
28    @     0x7fabe0700e25 start_thread
29    @     0x7fabdfd2034d __clone
30    @                0x0 (unknown)
  1. BE 内存泄漏 (LocalExchange)

LocalExchange 内存泄漏导致内存缓慢增长

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 2.2.13
    • 2.3.0 ~ 2.3.11
    • 2.4.0 ~ 2.4.4
    • 2.5.0 ~ 2.5.4
  • 修复版本:
    • 2.2.14+
    • 2.3.12+
    • 2.4.5+
    • 2.5.5+
  • 临时规避方法:
  • 问题原因:
    • 析构函数未定义成虚函数
  1. JDBC 外表查询 Crash

*** Aborted at 1681439883 (unix time) try "date -d @1681439883" if you are using GNU date ***
PC: @     0x7f87e0eb5720 __memcpy_ssse3_back
*** SIGSEGV (@0x7f83fcd49ffd) received by PID 1695 (TID 0x7f862f707700) from PID 18446744073656377341; stack trace: ***
    @          0x5769222 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f87e23749db os::Linux::chained_handler()
    @     0x7f87e23794bc JVM_handle_linux_signal
    @     0x7f87e236c378 signalHandler()
    @     0x7f87e184a630 (unknown)
    @     0x7f87e0eb5720 __memcpy_ssse3_back
    @          0x2c38092 starrocks::vectorized::BinaryColumnBase<>::append_selective()
    @          0x4d29e93 starrocks::vectorized::NullableColumn::append_selective()
    @          0x4d0d42a starrocks::vectorized::Chunk::append_selective()
    @          0x310b6ee starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
    @          0x310bfc7 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
    @          0x2c57583 starrocks::pipeline::PipelineDriver::process()
    @          0x4e075e7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x4812f8d starrocks::ThreadPool::dispatch_thread()
    @          0x480dd1a starrocks::Thread::supervise_thread()
    @     0x7f87e1842ea5 start_thread
    @     0x7f87e0e5db0d __clone
    @                0x0 (unknown)
  1. Multi distinct rewrite 报错

(1064, 'There are multi count(distinct) function call, multi distinct rewrite error')
  1. Persistent compaction crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1684761824 (unix time) try "date -d @1684761824" if you are using GNU date ***
PC: @          0x315124a starrocks::PersistentIndex::_merge_compaction()
*** SIGFPE (@0x315124a) received by PID 19356 (TID 0x7f792b578700) from PID 51712586; stack trace: ***
    @          0x4877742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f794dda7630 (unknown)
    @          0x315124a starrocks::PersistentIndex::_merge_compaction()
    @          0x315529e starrocks::PersistentIndex::commit()
    @          0x2ed2c8e starrocks::PrimaryIndex::commit()
    @          0x2fa6786 starrocks::TabletUpdates::_apply_rowset_commit()
    @          0x2fa9023 starrocks::TabletUpdates::do_apply()
    @          0x37a9945 starrocks::ThreadPool::dispatch_thread()
    @          0x37a4d7a starrocks::Thread::supervise_thread()
    @     0x7f794dd9fea5 start_thread
    @     0x7f794d3ba8dd __clone
    @                0x0 (unknown)
  1. Bitmap index apply crash

这个问题,也会导致 BitmapIndex 查询结果不对, 一般命中多个 BitmapIndex 的时候容易触发

*** Aborted at 1666056468 (unix time) try "date -d @1666056468" if you are using GNU date ***
PC: @          0x416239c run_container_andnot
*** SIGSEGV (@0x0) received by PID 38015 (TID 0x7f6c3cb49700) from PID 0; stack trace: ***
    @          0x3cf85d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f6c99db5630 (unknown)
    @          0x416239c run_container_andnot
    @          0x4160ab9 run_run_container_andnot
    @          0x4160aef run_run_container_iandnot
    @          0x4146ad5 roaring_bitmap_andnot_inplace
    @          0x1a6b0de starrocks::vectorized::SegmentIterator::_apply_bitmap_index()
    @          0x1a6fe4a starrocks::vectorized::SegmentIterator::_init()
    @          0x1a70539 starrocks::vectorized::SegmentIterator::do_get_next()
    @          0x1acd5b2 starrocks::vectorized::ProjectionIterator::do_get_next()
    @          0x1e2dc0a starrocks::SegmentIteratorWrapper::do_get_next()
    @          0x1b0566b starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x1afe0ce starrocks::vectorized::TabletReader::do_get_next()
    @          0x27c9c4d starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
    @          0x27ca2d0 starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
    @          0x27cd573 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE0_E9_M_invokeERKSt9_Any_data
    @          0x1e54dd0 starrocks::PriorityThreadPool::work_thread()
    @          0x3c93c07 thread_proxy
    @     0x7f6c99dadea5 start_thread
    @     0x7f6c993c8b0d __clone
    @                0x0 (unknown)
  1. Local Shuffle Crash

#0  0x00000000025818f6 in starrocks::vectorized::Chunk::clone_empty_with_slot (this=0x15ebf4b70, size=212) at /root/starrocks/be/src/column/chunk.cpp:188
#1  0x0000000002581dc3 in starrocks::vectorized::Chunk::clone_empty_with_slot (this=0x15ebf4b70) at /root/starrocks/be/src/column/chunk.cpp:181
#2  0x0000000002a71f10 in starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk (this=0xa7267210, state=0x3e6b76000) at /root/starrocks/be/src/exec/pipeline/exchange/local_exchange_source_operator.cpp:112
#3  0x0000000002a72a67 in starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk (this=0xa7267210, state=0x3e6b76000) at /root/starrocks/be/src/exec/pipeline/exchange/local_exchange_source_operator.cpp:75
#4  0x000000000297df33 in starrocks::pipeline::PipelineDriver::process (this=this@entry=0xaf736910, runtime_state=runtime_state@entry=0x3e6b76000, worker_id=worker_id@entry=23) at /root/starrocks/be/src/exec/pipeline/pipeline_driver.cpp:164
#5  0x000000000297462e in starrocks::pipeline::GlobalDriverExecutor::_worker_thread (this=0xa77d880) at /root/starrocks/be/src/exec/pipeline/pipeline_driver_executor.cpp:124
#6  0x00000000021da2c9 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:622
#7  starrocks::FunctionRunnable::run (this=<optimized out>) at /root/starrocks/be/src/util/threadpool.cpp:44
#8  starrocks::ThreadPool::dispatch_thread (this=0xbda1500) at /root/starrocks/be/src/util/threadpool.cpp:513
#9  0x00000000021d5e7a in std::function<void ()>::operator()() const (this=0x298e4c58) at /usr/include/c++/10.3.0/bits/std_function.h:622
#10 starrocks::Thread::supervise_thread (arg=0x298e4c40) at /root/starrocks/be/src/util/thread.cpp:326
#11 0x00007fa26ee31ea5 in ?? ()
#12 0x0000000000000000 in ?? ()
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.0 ~ 2.3.12
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.5
    • 3.0.0
  • 修复版本:
    • 2.3.13+
    • 2.4.6+
    • 2.5.6+
    • 3.0.1+
  • 临时规避方法:
  • 问题原因:
    • 见 Issue 描述
  1. 主键模型 SchemaChange 后,命中前缀索引的查询,返回结果不对
    加索引,改列类型都有可能触发
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0
  • 修复版本:
    • 2.4.6+
    • 2.5.7+
    • 3.0.1+
  • 临时规避方法:
  • 问题原因:
    • SchemaChange 后没有做HeapMerge,而只是将文件拼接在一起,导致Segment文件排序有问题
1赞
  1. GroupBy tinyint Crash

也有可能结果不对

*** Aborted at 1685449309 (unix time) try "date -d @1685449309" if you are using GNU date ***
PC: @          0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
*** SIGSEGV (@0x10) received by PID 14038 (TID 0x7f3361305700) from PID 16; stack trace: ***
    @          0x6240182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f33df4ed630 (unknown)
    @          0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
    @          0x34eeb6e starrocks::Aggregator::compute_batch_agg_states_with_selection()
    @          0x311b8ab starrocks::AggregateBlockingNode::open()
    @          0x4f4fe14 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4f5221d starrocks::PlanFragmentExecutor::open()
    @          0x4e9cb2b starrocks::FragmentExecState::execute()
    @          0x4ea3303 starrocks::FragmentMgr::exec_actual()
    @          0x506b022 starrocks::ThreadPool::dispatch_thread()
    @          0x5065b1a starrocks::Thread::supervise_thread()
    @     0x7f33df4e5ea5 start_thread
    @     0x7f33deb00b0d __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.0 ~ 2.3.12
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0
  • 修复版本:
    • 2.3.13+
    • 2.4.6+
    • 2.5.7+
    • 3.0.1+
  • 临时规避方法:
    • cast(tinyint as int)
  • 问题原因:
    • FixedSizedHashTable优化导致
  1. Select * 与 select count(*) 结果不一致

这种问题的原因一般是 Segment 文件数据排序结果不对,导致通过前缀索引查询出的结果不对.

ScheamChange 修改 Key 列后,查询结果不一致。

  1. Lambda function 内存泄漏
  1. regexp_replace 内存泄漏

  1. Expression child number xxxx exceeded the maximum 10000

Expression child number xxxx exceeded the maximum 10000
  • 涉及版本:
    • 所有版本
  • 解决办法
    • 修改 fe.conf expr_children_limit=50000 (默认值是10000, 改大会有性能和资源过量使用风险)
  • 问题原因
    • 当前会限制最大的子 Expr 数量,防止出现性能和资源过量使用问题。
    • 如果有 insert into values 很多行数据的需求,建议使用 stream load/ routine load/ flink 等导入方式
  1. native_queued_spin_lock_slowpath 占用 CPU 比较高

Perf top 看到这种现象: native_queued_spin_lock_slowpath 占用了大量 CPU

一般在核数比较多的机器,并且并发比较高的场景比较严重

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ latest
    • 2.3.0 ~ latest
  • 修复版本:
    • 2.4+
  • 临时规避方法:
  • 问题原因:
    • TCMalloc 在 Numa 架构下性能不好, 2.4+ 版本已经更换为 Jemalloc
  1. Join runtime filter merge crash

*** Aborted at 1686552759 (unix time) try "date -d @1686552759" if you are using GNU date ***
PC: @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
*** SIGSEGV (@0x207fa2c000) received by PID 40541 (TID 0x7f6fd50cb700) from PID 2141372416; stack trace: ***
    @          0x3f8c022 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f707104a630 (unknown)
    @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
    @          0x33eb37e starrocks::vectorized::RuntimeFilterHelper::fill_runtime_bloom_filter()
    @          0x2a3824a starrocks::pipeline::PartialRuntimeFilterMerger::merge_local_bloom_filters()
    @          0x2a349bf starrocks::pipeline::HashJoinBuildOperator::set_finishing()
    @          0x29df067 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x29dfc85 starrocks::pipeline::PipelineDriver::process()
    @          0x29d65be starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2235c39 starrocks::ThreadPool::dispatch_thread()
    @          0x22317ea starrocks::Thread::supervise_thread()
    @     0x7f7071042ea5 start_thread
    @     0x7f707065d9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 2.2.14
    • 2.3.0 ~ 2.3.13
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0 ~ 3.0.2
  • 修复版本:
    • 2.2.15+
    • 2.3.14+
    • 2.4.6+
    • 2.5.7+
    • 3.0.3+
  • 临时规避方法 (会影响性能):
    • set global enable_global_runtime_filter=false;
    • set global runtime_join_filter_push_down_limit=0;
  • 问题原因:
    • Join 的 on 列是字符串 (右表),并且长度总合大于 4G
  1. Partial update 导致 be 启动 crash

BE 启动加载 Tablet 反复 Crash

*** SIGSEGV (@0x8) received by PID 244327 (TID 0x7facab9fe700) from PID 8; stack trace: ***
@         0x481e332 google::(anonymous namespace)::FailureSignalHandler()
@         0x7facdbc62630 (unknown)
@         0x24d2f7b starrocks::Rowset::do_load()
@         0x24d35cf starrocks::Rowset::load()
@         0x24d3966 starrocks::Rowset::get_segment_iterators2()
@         0x20334ec starrocks::RowsetUpdateState::_do_load()
@         0x2034f78 _ZZSt9call_onceIZN9starrocks17RowsetUpdateState4loadEPNS0_6TabletEPNS0_6RowsetEEUlvE_JEEvRSt9once_flagOT_DpOT0_ENUlvE0_4_FUNEv
@         0x7facdbc5920b __pthread_once_slow
@         0x202fd63 starrocks::RowsetUpdateState::load()
@         0x1e6da98 starrocks::TabletUpdates::_apply_rowset_commit()
@         0x1e73bb3 starrocks::TabletUpdates::do_apply()
@         0x2680af5 starrocks::ThreadPool::dispatch_thread()
@         0x267bf2a starrocks::supervise_thread()
@         0x7facdbc5aea5 start_thread
@         0x7facdb27596d __clone
@         0x0 (unknown)
  1. 使用表达式自动创建分区功能 Crash

类似于这种: PARTITION BY date_trunc('day', dt)
*** Aborted at 1686563917 (unix time) try "date -d @1686563917" if you are using GNU date ***
PC: @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
*** SIGSEGV (@0x4f95) received by PID 725038 (TID 0x7f70fd996700) from PID 20373; stack trace: ***
    @          0x62de642 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f715ea9b370 (unknown)
    @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
    @          0x56f695e starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5710634 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4fd3928 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4fd5a2d starrocks::PlanFragmentExecutor::open()
    @          0x4f1f9bb starrocks::FragmentExecState::execute()
    @          0x4f261e3 starrocks::FragmentMgr::exec_actual()
    @          0x50ed9b2 starrocks::ThreadPool::dispatch_thread()
    @          0x50e84aa starrocks::Thread::supervise_thread()
    @     0x7f715ea93dc5 start_thread
    @     0x7f715e0b476d __clone
    @                0x0 (unknown)
1赞