- Insert overwrite 或 物化视图刷新 导致 FE follower 内存泄漏
- Github Issue:
- Github Fix PR:
- Jira:
- 问题版本:
- 2.2.0 ~ 2.2.12
- 2.3.0 ~ 2.3.8
- 2.4.0 ~ 2.4.3
- 2.5.0 ~ 2.5.1
- 修复版本:
- 2.2.13+
- 2.3.9+
- 2.4.4+
- 2.5.2+
- 临时规避方法:
- 重启 FE
- 问题原因:
- 见 issue 描述
*** SIGSEGV (@0x0) received by PID 287072 (TID 0x7f99979d7640) from PID 0; stack trace: ***
6 @ 0x56ec9c2 google::(anonymous namespace)::FailureSignalHandler()
7 @ 0x7f9b214fb1d0 (unknown)
8 @ 0x453d51d std::__push_heap<>()
9 @ 0x454a1d9 starrocks::vectorized::HeapMergeIterator::fill()
10 @ 0x454b046 starrocks::vectorized::HeapMergeIterator::do_get_next()
11 @ 0x454b704 starrocks::vectorized::HeapMergeIterator::do_get_next()
12 @ 0x43d40c3 starrocks::vectorized::TimedChunkIterator::do_get_next()
13 @ 0x43cfbae starrocks::vectorized::AggregateIterator::do_get_next()
14 @ 0x43d1254 starrocks::vectorized::AggregateIterator::do_get_next()
15 @ 0x43d40c3 starrocks::vectorized::TimedChunkIterator::do_get_next()
16 @ 0x40484ae starrocks::vectorized::TabletReader::do_get_next()
17 @ 0x4021292 starrocks::vectorized::ProjectionIterator::do_get_next()
18 @ 0x2ee880b starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
19 @ 0x2ee8eeb starrocks::pipeline::OlapChunkSource::_read_chunk()
20 @ 0x2ed888c starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
21 @ 0x2c5bce4 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
22 @ 0x2c6cd1d starrocks::workgroup::ScanExecutor::worker_thread()
23 @ 0x47a3a9d starrocks::ThreadPool::dispatch_thread()
24 @ 0x479e82a starrocks::Thread::supervise_thread()
25 @ 0x7f9b214f03fb start_thread
26 @ 0x7f9b1fd27c23 __GI___clone
27 @ 0x0 (unknown)
I0127 00:46:02.271173 7878 snapshot_manager.cpp:112] make primary snapshot tablet:334728 cur_version:65905 missing_version_ranges:65662 timeout:180
I0127 00:46:02.271206 7878 tablet_updates.cpp:2954] get_rowsets_for_snapshot: too many rowsets for incremental clone #rowset:244 #rowset_for_full_clone:1 tablet:334728 #version:1 [65905.1 65905.1@0 65905.1] #pending:0
W0127 00:46:02.271214 7878 agent_server.cpp:308] fail to make_snapshot. tablet_id:334728 msg:Not found: get_rowsets_for_snapshot: too many rowsets for incremental clone #rowset:244 #rowset_for_full_clone:1 tablet:334728 #version:1 [65905.1 65905.1@0 65905.1] #pending:0
Dict Decode failed, Dict can't take cover all key :0
一般表现为进程在,但是端口不通,pstack 进程号 发现线程比较少,CPU 使用1个核打满,
典型的 pstack 堆栈
#14 0x0000000001b2745b in starrocks::TabletUpdates::init() ()
#15 0x0000000001ad711f in starrocks::Tablet::_init_once_action() ()
#16 0x0000000001ad734d in void std::call_once<starrocks::Status starrocks::StarRocksCall0nce<starrocks::Status>::call<starrocks::Tablet::init()::flambda()#1}>(starrocks::Tablet::init()::{lambda()#1})::{lambda()#1}>(std::once_flag&, starrocks::Tablet::init()::{lambda()#1}&&)::{lambda()#2}::_FUN() ()
#17 Θx00007f8fΘe6f3e40 in pthread_once () from /lib64/libpthread.so.Θ
#18 Θx0000000001acad51 in starrocks::Tablet::init() ()
#19 Θx0000000001ae5849 in starrocks::TabletManager::load_tablet_from_meta(starrocks::DataDir*, long, int, std::basic_string_view<char, std::char_traits<c har> >, bool, bool, bool, bool) ()
#20 0x0000000001ac7314 in std::_Function_handler<bool (long, long, std::basic_string_view<char, std::char_traits<char> >), starrocks::DataDir::load()::fl ambda(long, int, std::basic_string_view<char, std::char_traits<char> >)#2)>::_M_invoke(std::_Any_data const&, long&&:, std::_Any_data const&, std::basic_s tring_view<char, std::char_traits<char> >&&) ()
#21 0x0000000001af7535 in std:: Function handler<bool (std::basic string view<char, std::char traits<char> >, std::basic string view<char, std::char trai ts<char> >), starrocks::TabletMetaManager::walk(starrocks::KVStore*, std::function<bool (long, long, std::basic_string_view<char, std::char_traits<char> >)> const&)::{lambda(std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)#1}>::_M_invoke(std::_A ny_data const&, std::basic_string_view<char, std::char_traits<char> >&&, std::_Any_data const&) ()
#22 0x0000000001c95670 in starrocks::KVStore::iterate(starrocks::ColumnFamilyIndex, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocat or<char> > const&, std::function<bool (std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)> con st&) ()
#23 Θx000000000laf9add in starrocks::TabletMetaManager::walk(starrocks::KVStore*, std::function<bool (long, long, std::basic_string_view<char, std::char) traits<char> >> const&) ()
#24 ΘxΘ000000001ac8a01 in starrocks::DataDir::load() ()
#25 0x0000000001ab1b56 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<starrocks::StorageEngine::load_data_dirs(std::vector<starrocks::DataD ir*, std::allocator<starrocks::DataDir*> > const&)::{lambda()#1}> >>::_M_run() ()
#26 Θx0000000005b5f9f0 in execute_native_thread_routine ()
#27 Θx00007f8f0e6eedd5 in start_thread () from /lib64/libpthread.so.0
#28 Θx00007f8f0dd09ead in clone () from /lib64/libc.so.6
PC: @ 0x4ccb226 starrocks::vectorized::Chunk::clone_empty_with_slot()
*** SIGSEGV (@0x0) received by PID 268757 (TID 0x7fc78adcc700) from PID 0; stack trace: ***
@ 0x5722822 google::(anonynous namespace)::FailureSignalHandler()
@ 0x7fc895dd9630 (unknown)
@ 0x4ccb226 starrocks::vectorized::Chunk::clone_empty_with_slot()
@ 0x4ccb953 starrocks::vectorized::Chunk::clone_empty_with_slot()
@ 0x30ec210 starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
@ 0x30ecd67 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
@ 0x2c3fbe3 starrocks::pipeline::PipelineDriver::process()
@ 0x4dclbd7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x47d2acd starrocks::ThreadPool::dispatch_thread()
@ 0x47cd85a starrocks::Thread::supervise_thread()
@ 0x7fc895ddlea5 start_thread
@ 0x7fc8953ecb0d _clone
@ 0x0 (unknown)
*** Aborted at 1678163575 (unix time) try "date -d @1678163575" if you are using GNU date ***
PC: @ 0x7ff0c4cea00b gsignal
*** SIGABRT (@0x2b4) received by PID 692 (TID 0x7fef117b9700) from PID 692; stack trace: ***
@ 0x54d8a82 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7ff0c4ead420 (unknown)
@ 0x7ff0c4cea00b gsignal
@ 0x7ff0c4cc9859 abort
@ 0x29f47de _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
@ 0x7996116 __cxxabiv1::__terminate()
@ 0x7996181 std::terminate()
@ 0x79962d4 __cxa_throw
@ 0x29f46f6 _Znwm.cold
@ 0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7a1016b std::__cxx11::basic_string<>::_M_append()
@ 0x54d5dbf google::DumpStackTrace()
@ 0x45d1c2c __wrap___cxa_throw
@ 0x29f46f6 _Znwm.cold
@ 0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7a1016b std::__cxx11::basic_string<>::_M_append()
@ 0x54d5dbf google::DumpStackTrace()
@ 0x45d1c2c __wrap___cxa_throw
@ 0x29f46f6 _Znwm.cold
@ 0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7a1016b std::__cxx11::basic_string<>::_M_append()
@ 0x54d5dbf google::DumpStackTrace()
@ 0x45d1c2c __wrap___cxa_throw
@ 0x29f46f6 _Znwm.cold
@ 0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7a1016b std::__cxx11::basic_string<>::_M_append()
@ 0x54d5dbf google::DumpStackTrace()
@ 0x45d1c2c __wrap___cxa_throw
@ 0x29f46f6 _Znwm.cold
@ 0x7a0eb4a std::__cxx11::basic_string<>::_M_mutate()
@ 0x7a1016b std::__cxx11::basic_string<>::_M_append()
@ 0x54d5dbf google::DumpStackTrace()
这个问题在2.3.10版本也存在,已经提 pr backport了
I0320 10:52:20.131407 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 2354181, txn_id: 22394960, tablet: 2354286.1722053141.5247b51b58eaadec-44cacc1b65f51b9d
I0320 10:52:20.131418 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1828787, txn_id: 22394966, tablet: 1829568.820590957.614c3ce0d15a9dd9-f9ef196bf03b2dab
I0320 10:52:20.131428 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1635218, txn_id: 22394960, tablet: 1636052.1722053141.cb423e755483acac-b08fbe036699c887
I0320 10:52:20.131438 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1808762, txn_id: 22394966, tablet: 1808815.820590957.c447965a8662e1af-cfde7a5cd56fca94
I0320 10:52:20.131448 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 5578541, txn_id: 22394960, tablet: 5579122.1722053141.8f49146ca3ac6f35-67c56dd44c2429bf
I0320 10:52:20.131459 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4405613, txn_id: 22394966, tablet: 4405866.820590957.554c400665d2eb88-10d285ea9abd389f
I0320 10:52:20.131469 38381 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4782620, txn_id: 22394960, tablet: 4783417.1722053141.314f677faf599b17-c29c8c1dfa293ab4
I0320 10:52:20.131480 38394 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 6039076, txn_id: 22394966, tablet: 6039273.820590957.7e41fcd0055968d5-693879b6b9cf63bd
I0320 10:52:20.131492 38387 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 4223033, txn_id: 22394960, tablet: 4223086.1722053141.724cb54a62fe404a-71ab614633f2878b
I0320 10:52:20.131505 38418 txn_manager.cpp:398] rollback transaction from engine successfully. partition_id: 1836797, txn_id: 22394966, tablet: 1837158.820590957.644cbc3da2faa67b-c63bf158d58e65a4
可以手动触发,但是无法自动触发
*** Aborted at 1680758973 (unix time) try "date -d @1680758973" if you are using GNU date ***
6PC: @ 0x4086310 google::protobuf::Message::SpaceUsedLong()
7*** SIGSEGV (@0x0) received by PID 75495 (TID 0x7fabb41aa700) from PID 0; stack trace: ***
8 @ 0x3cb75d2 google::(anonymous namespace)::FailureSignalHandler()
9 @ 0x7fabe07085e0 (unknown)
10 @ 0x4086310 google::protobuf::Message::SpaceUsedLong()
11 @ 0x1c26814 starrocks::ZoneMapIndexReader::mem_usage()
12 @ 0x1ba5e80 starrocks::ColumnReader::_load_zonemap_index()
13 @ 0x1ba5fad starrocks::ColumnReader::zone_map_filter()
14 @ 0x1bf4959 starrocks::ScalarColumnIterator::get_row_ranges_by_zone_map()
15 @ 0x1a33c72 starrocks::vectorized::SegmentIterator::_get_row_ranges_by_zone_map()
16 @ 0x1a3493f starrocks::vectorized::SegmentIterator::_init()
17 @ 0x1a35029 starrocks::vectorized::SegmentIterator::do_get_next()
18 @ 0x1a927f2 starrocks::vectorized::ProjectionIterator::do_get_next()
19 @ 0x1def65a starrocks::SegmentIteratorWrapper::do_get_next()
20 @ 0x1aca8ab starrocks::vectorized::TimedChunkIterator::do_get_next()
21 @ 0x1ad0554 starrocks::vectorized::UnionIterator::do_get_next()
22 @ 0x1ac330e starrocks::vectorized::TabletReader::do_get_next()
23 @ 0x27880bd starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
24 @ 0x2788740 starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
25 @ 0x278b9e3 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE0_E9_M_invokeERKSt9_Any_data
26 @ 0x1e16820 starrocks::PriorityThreadPool::work_thread()
27 @ 0x3c52c07 thread_proxy
28 @ 0x7fabe0700e25 start_thread
29 @ 0x7fabdfd2034d __clone
30 @ 0x0 (unknown)
LocalExchange 内存泄漏导致内存缓慢增长
*** Aborted at 1681439883 (unix time) try "date -d @1681439883" if you are using GNU date ***
PC: @ 0x7f87e0eb5720 __memcpy_ssse3_back
*** SIGSEGV (@0x7f83fcd49ffd) received by PID 1695 (TID 0x7f862f707700) from PID 18446744073656377341; stack trace: ***
@ 0x5769222 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f87e23749db os::Linux::chained_handler()
@ 0x7f87e23794bc JVM_handle_linux_signal
@ 0x7f87e236c378 signalHandler()
@ 0x7f87e184a630 (unknown)
@ 0x7f87e0eb5720 __memcpy_ssse3_back
@ 0x2c38092 starrocks::vectorized::BinaryColumnBase<>::append_selective()
@ 0x4d29e93 starrocks::vectorized::NullableColumn::append_selective()
@ 0x4d0d42a starrocks::vectorized::Chunk::append_selective()
@ 0x310b6ee starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
@ 0x310bfc7 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
@ 0x2c57583 starrocks::pipeline::PipelineDriver::process()
@ 0x4e075e7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4812f8d starrocks::ThreadPool::dispatch_thread()
@ 0x480dd1a starrocks::Thread::supervise_thread()
@ 0x7f87e1842ea5 start_thread
@ 0x7f87e0e5db0d __clone
@ 0x0 (unknown)
(1064, 'There are multi count(distinct) function call, multi distinct rewrite error')
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1684761824 (unix time) try "date -d @1684761824" if you are using GNU date ***
PC: @ 0x315124a starrocks::PersistentIndex::_merge_compaction()
*** SIGFPE (@0x315124a) received by PID 19356 (TID 0x7f792b578700) from PID 51712586; stack trace: ***
@ 0x4877742 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f794dda7630 (unknown)
@ 0x315124a starrocks::PersistentIndex::_merge_compaction()
@ 0x315529e starrocks::PersistentIndex::commit()
@ 0x2ed2c8e starrocks::PrimaryIndex::commit()
@ 0x2fa6786 starrocks::TabletUpdates::_apply_rowset_commit()
@ 0x2fa9023 starrocks::TabletUpdates::do_apply()
@ 0x37a9945 starrocks::ThreadPool::dispatch_thread()
@ 0x37a4d7a starrocks::Thread::supervise_thread()
@ 0x7f794dd9fea5 start_thread
@ 0x7f794d3ba8dd __clone
@ 0x0 (unknown)
这个问题,也会导致 BitmapIndex 查询结果不对, 一般命中多个 BitmapIndex 的时候容易触发
*** Aborted at 1666056468 (unix time) try "date -d @1666056468" if you are using GNU date ***
PC: @ 0x416239c run_container_andnot
*** SIGSEGV (@0x0) received by PID 38015 (TID 0x7f6c3cb49700) from PID 0; stack trace: ***
@ 0x3cf85d2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f6c99db5630 (unknown)
@ 0x416239c run_container_andnot
@ 0x4160ab9 run_run_container_andnot
@ 0x4160aef run_run_container_iandnot
@ 0x4146ad5 roaring_bitmap_andnot_inplace
@ 0x1a6b0de starrocks::vectorized::SegmentIterator::_apply_bitmap_index()
@ 0x1a6fe4a starrocks::vectorized::SegmentIterator::_init()
@ 0x1a70539 starrocks::vectorized::SegmentIterator::do_get_next()
@ 0x1acd5b2 starrocks::vectorized::ProjectionIterator::do_get_next()
@ 0x1e2dc0a starrocks::SegmentIteratorWrapper::do_get_next()
@ 0x1b0566b starrocks::vectorized::TimedChunkIterator::do_get_next()
@ 0x1afe0ce starrocks::vectorized::TabletReader::do_get_next()
@ 0x27c9c4d starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
@ 0x27ca2d0 starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
@ 0x27cd573 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE0_E9_M_invokeERKSt9_Any_data
@ 0x1e54dd0 starrocks::PriorityThreadPool::work_thread()
@ 0x3c93c07 thread_proxy
@ 0x7f6c99dadea5 start_thread
@ 0x7f6c993c8b0d __clone
@ 0x0 (unknown)
#0 0x00000000025818f6 in starrocks::vectorized::Chunk::clone_empty_with_slot (this=0x15ebf4b70, size=212) at /root/starrocks/be/src/column/chunk.cpp:188
#1 0x0000000002581dc3 in starrocks::vectorized::Chunk::clone_empty_with_slot (this=0x15ebf4b70) at /root/starrocks/be/src/column/chunk.cpp:181
#2 0x0000000002a71f10 in starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk (this=0xa7267210, state=0x3e6b76000) at /root/starrocks/be/src/exec/pipeline/exchange/local_exchange_source_operator.cpp:112
#3 0x0000000002a72a67 in starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk (this=0xa7267210, state=0x3e6b76000) at /root/starrocks/be/src/exec/pipeline/exchange/local_exchange_source_operator.cpp:75
#4 0x000000000297df33 in starrocks::pipeline::PipelineDriver::process (this=this@entry=0xaf736910, runtime_state=runtime_state@entry=0x3e6b76000, worker_id=worker_id@entry=23) at /root/starrocks/be/src/exec/pipeline/pipeline_driver.cpp:164
#5 0x000000000297462e in starrocks::pipeline::GlobalDriverExecutor::_worker_thread (this=0xa77d880) at /root/starrocks/be/src/exec/pipeline/pipeline_driver_executor.cpp:124
#6 0x00000000021da2c9 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:622
#7 starrocks::FunctionRunnable::run (this=<optimized out>) at /root/starrocks/be/src/util/threadpool.cpp:44
#8 starrocks::ThreadPool::dispatch_thread (this=0xbda1500) at /root/starrocks/be/src/util/threadpool.cpp:513
#9 0x00000000021d5e7a in std::function<void ()>::operator()() const (this=0x298e4c58) at /usr/include/c++/10.3.0/bits/std_function.h:622
#10 starrocks::Thread::supervise_thread (arg=0x298e4c40) at /root/starrocks/be/src/util/thread.cpp:326
#11 0x00007fa26ee31ea5 in ?? ()
#12 0x0000000000000000 in ?? ()
也有可能结果不对
*** Aborted at 1685449309 (unix time) try "date -d @1685449309" if you are using GNU date ***
PC: @ 0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
*** SIGSEGV (@0x10) received by PID 14038 (TID 0x7f3361305700) from PID 16; stack trace: ***
@ 0x6240182 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f33df4ed630 (unknown)
@ 0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
@ 0x34eeb6e starrocks::Aggregator::compute_batch_agg_states_with_selection()
@ 0x311b8ab starrocks::AggregateBlockingNode::open()
@ 0x4f4fe14 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
@ 0x4f5221d starrocks::PlanFragmentExecutor::open()
@ 0x4e9cb2b starrocks::FragmentExecState::execute()
@ 0x4ea3303 starrocks::FragmentMgr::exec_actual()
@ 0x506b022 starrocks::ThreadPool::dispatch_thread()
@ 0x5065b1a starrocks::Thread::supervise_thread()
@ 0x7f33df4e5ea5 start_thread
@ 0x7f33deb00b0d __clone
@ 0x0 (unknown)