常见 Crash / BUG / 优化 查询

  1. FE 启动报 duplicate key 错误(物化视图导致)

com.google.gson.JsonSyntaxException: duplicate key: 7417179
        at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:190) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:145) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:963) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:928) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:877) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:848) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.fromJson(Gson.java:848) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.ChangeMaterializedViewRefreshSchemeLog.read(ChangeMaterializedViewRefreshSchemeLog.java:71) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.JournalEntity.readFields(JournalEntity.java:358) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.bdbje.BDBJournalCursor.deserializeData(BDBJournalCursor.java:251) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.bdbje.BDBJournalCursor.next(BDBJournalCursor.java:295) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2137) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2097) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1142) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:324) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:721) ~[starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
  1. BE 灰度过程中,内存暴涨

W0821 21:18:12.877478 64468 mem_hook.cpp:254] large memory alloc: 1988991648 bytes, stack:
    @          0x48a322b  malloc
    @          0x7df6b05  operator new()
    @          0x4822dd6  starrocks::QueryStatistics::merge_pb()
    @          0x4822f6f  starrocks::QueryStatisticsRecvr::insert()
    @          0x479ca34  starrocks::DataStreamMgr::transmit_chunk()
    @          0x52130d8  starrocks::PInternalServiceImplBase<>::transmit_chunk()
    @          0x5b8a2ad  brpc::policy::ProcessRpcRequest()
    @          0x5c6aa57  brpc::ProcessInputMessage()
    @          0x5ab4bef  bthread::TaskGroup::task_runner()
    @          0x5bf9151  bthread_make_fcontext
W0821 21:17:30.210508 62349 mem_hook.cpp:254] large memory alloc: 1091033952 bytes, stack:
    @          0x48a322b  malloc
    @          0x7df6b05  operator new()
    @          0x48229f6  starrocks::QueryStatistics::merge()
    @          0x4822ad2  starrocks::QueryStatisticsRecvr::aggregate()
    @          0x2d016b4  starrocks::pipeline::QueryContext::intermediate_query_statistic()
    @          0x47d124d  starrocks::RuntimeState::intermediate_query_statistic()
    @          0x4f9d87b  starrocks::pipeline::ExchangeSinkOperator::Channel::send_one_chunk()
    @          0x4f9e397  starrocks::pipeline::ExchangeSinkOperator::Channel::_close_internal()
    @          0x4f9e46c  starrocks::pipeline::ExchangeSinkOperator::Channel::close()
    @          0x4f9e849  starrocks::pipeline::ExchangeSinkOperator::set_finishing()
    @          0x2d1c9e9  starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x2d1da2f  starrocks::pipeline::PipelineDriver::process()
    @          0x4f91993  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x4983a52  starrocks::ThreadPool::dispatch_thread()
    @          0x497e54a  starrocks::Thread::supervise_thread()
    @     0x7f36e1153ea5  start_thread
    @     0x7f36e076eb0d  __clone
    @              (nil)  (unknown)
  • Github Issue:
  • Github Fix PR:
  • 问题版本:
    • 2.3.0 ~ 2.3.16
    • 2.4.0 ~ 2.4.5
    • 2.5.10
    • 3.0.0 ~ 3.0.5
    • 3.1.0 ~ 3.1.1
  • 修复版本:
    • 2.3.17+
    • 2.4.6+
    • 2.5.11+
    • 3.0.6+
    • 3.1.2+
  • 临时修复方法:
    • BE 全部升级后,会恢复
  1. 使用物化视图,或是 insert overwrite 后,trash 目录磁盘空间增长过快

  1. SchemaChange 后,BE 读数据,列对不上, Crash

terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_fill_insert
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1692869217 (unix time) try "date -d @1692869217" if you are using GNU date ***
PC: @     0x7f33942321d7 __GI_raise
*** SIGABRT (@0xd680e) received by PID 878606 (TID 0x7f32a4e08700) from PID 878606; stack trace: ***
    @          0x3f973c2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f3394cdb370 (unknown)
    @     0x7f33942321d7 __GI_raise
    @     0x7f33942338c8 __GI_abort
    @          0x196a143 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x5baf946 __cxxabiv1::__terminate()
    @          0x5baf9b1 std::terminate()
    @          0x5bafb04 __cxa_throw
    @          0x196bcf3 std::__throw_length_error()
    @          0x1a9d3a0 std::vector<>::_M_fill_insert()
    @          0x25f83b8 starrocks::vectorized::NullableColumn::append_numbers()
    @          0x1d98ff1 starrocks::BitShufflePageDecoder<>::next_batch()
    @          0x1deb3d2 starrocks::ParsedPageV2::read()
    @          0x1dc146a starrocks::ScalarColumnIterator::next_batch()
    @          0x1c1437b starrocks::vectorized::SegmentIterator::_read()
    @          0x1c0cb9c starrocks::vectorized::SegmentIterator::_do_get_next()
    @          0x1c10611 starrocks::vectorized::SegmentIterator::do_get_next()
    @          0x20a44aa starrocks::SegmentIteratorWrapper::do_get_next()
    @          0x1ca090b starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x1c9959e starrocks::vectorized::TabletReader::do_get_next()
    @          0x2ae3c8d starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
    @          0x2ae434a starrocks::pipeline::OlapChunkSource::buffer_next_batch_chunks_blocking()
    @          0x29ade88 _ZNSt17_Function_handlerIFvvEZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS1_12RuntimeStateEiEUlvE_E9_M_invokeERKSt9_Any_data
    @          0x2a5550e starrocks::workgroup::ScanExecutor::worker_thread()
    @          0x22395b9 starrocks::ThreadPool::dispatch_thread()
    @          0x223516a starrocks::Thread::supervise_thread()
    @     0x7f3394cd3dc5 start_thread
    @     0x7f33942f473d __clone
    @                0x0 (unknown)
  1. sub_bitmap 函数 crash

query_id:45a32592-4589-11ee-b504-525400ff5865, fragment_instance:45a32592-4589-11ee-b504-525400ff5866
*** Aborted at 1693216637 (unix time) try "date -d @1693216637" if you are using GNU date ***
PC: @          0x46e600e starrocks::BitmapValue::sub_bitmap_internal()
*** SIGSEGV (@0x28) received by PID 67995 (TID 0x7feb72548700) from PID 40; stack trace: ***
    @          0x56ec9c2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fec080b3630 (unknown)
    @          0x46e600e starrocks::BitmapValue::sub_bitmap_internal()
    @          0x4e0b857 starrocks::vectorized::BitmapFunctions::sub_bitmap()
    @          0x3cee06b starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x36d8f47 starrocks::ExprContext::evaluate()
    @          0x3ced97c starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x36d8fde starrocks::ExprContext::evaluate()
    @          0x2eced82 starrocks::pipeline::ProjectOperator::push_chunk()
    @          0x2c39826 starrocks::pipeline::PipelineDriver::process()
    @          0x4d8cff7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x47a3a9d starrocks::ThreadPool::dispatch_thread()
    @          0x479e82a starrocks::Thread::supervise_thread()
    @     0x7fec080abea5 start_thread
    @     0x7fec076c696d __clone
    @                0x0 (unknown)
  1. 主键模型表 backup/restore 后, BE crash,并且无法重启

PC: @     0x7fb573e62387 __GI_raise
*** SIGABRT (@0x4b200005492) received by PID 21650 (TID 0x7fb5301ff700) from PID 21650; stack trace: ***
    @          0x5aed0a2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fb574917630 (unknown)
    @     0x7fb573e62387 __GI_raise
    @     0x7fb573e63a78 __GI_abort
    @          0x2cce6fe starrocks::failure_function()
    @          0x5ae0a7d google::LogMessage::Fail()
    @          0x5ae2eef google::LogMessage::SendToLog()
    @          0x5ae05ce google::LogMessage::Flush()
    @          0x5ae34f9 google::LogMessageFatal::~LogMessageFatal()
    @          0x424efdc starrocks::TabletUpdates::_apply_rowset_commit()
    @          0x424f4c3 starrocks::TabletUpdates::do_apply()
    @          0x4af5cd5 starrocks::ThreadPool::dispatch_thread()
    @          0x4af06ba starrocks::Thread::supervise_thread()
    @     0x7fb57490fea5 start_thread
    @     0x7fb573f2ab0d __clone
    @                0x0 (unknown)
*** Aborted at 1692763575 (unix time) try "date -d @1692763575" if you are using GNU date ***
PC: @     0x7faa8bf8f387 __GI_raise
*** SIGABRT (@0x4b20000977a) received by PID 38778 (TID 0x7faa68c74700) from PID 38778; stack trace: ***
    @          0x5aed0a2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7faa8ca44630 (unknown)
    @     0x7faa8bf8f387 __GI_raise
    @     0x7faa8bf90a78 __GI_abort
    @          0x2cce6fe starrocks::failure_function()
    @          0x5ae0a7d google::LogMessage::Fail()
    @          0x5ae2eef google::LogMessage::SendToLog()
    @          0x5ae05ce google::LogMessage::Flush()
    @          0x5ae34f9 google::LogMessageFatal::~LogMessageFatal()
    @          0x41c8a12 starrocks::DataDir::load()
    @          0x41a9e3b _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN9starrocks13StorageEngine14load_data_dirsERKSt6vectorIPNS3_7DataDirESaIS7_EEEUlvE_EEEEE6_M_runEv
    @          0x7ffb6e0 execute_native_thread_routine
    @     0x7faa8ca3cea5 start_thread
    @     0x7faa8c057b0d __clone
    @                0x0 (unknown)

*** Aborted at 1694299670 (unix time) try "date -d @1694299670" if you are using GNU date ***
PC: @     0x7f0115956ca0 __GI_raise
*** SIGABRT (@0x14cb) received by PID 5323 (TID 0x7eff61773700) from PID 5323; stack trace: ***
    @          0x596c182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f011642d8e0 (unknown)
    @     0x7f0115956ca0 __GI_raise
    @     0x7f0115958148 __GI_abort
    @          0x2c85b3e starrocks::failure_function()
    @          0x595fb5d google::LogMessage::Fail()
    @          0x5961fcf google::LogMessage::SendToLog()
    @          0x595f6ae google::LogMessage::Flush()
    @          0x59625d9 google::LogMessageFatal::~LogMessageFatal()
    @          0x4587268 starrocks::BinaryDictPageDecoder<>::next_batch()
    @          0x45a4913 starrocks::ParsedPageV2::read()
    @          0x4578b91 starrocks::ScalarColumnIterator::next_batch()
    @          0x4168c73 starrocks::vectorized::SegmentIterator::_read()
    @          0x415f48c starrocks::vectorized::SegmentIterator::_do_get_next()
    @          0x4162a80 starrocks::vectorized::SegmentIterator::do_get_next()
    @          0x41e3226 starrocks::vectorized::MaskMergeIterator::do_get_next()
    @          0x423e01c starrocks::vectorized::RowsetMergerImpl<>::_do_merge_vertically()
    @          0x423f964 starrocks::vectorized::RowsetMergerImpl<>::do_merge()
    @          0x42334b7 starrocks::vectorized::compaction_merge_rowsets()
    @          0x40e9ee6 starrocks::TabletUpdates::_do_compaction()
    @          0x40eb370 starrocks::TabletUpdates::compaction()
    @          0x4047e73 starrocks::StorageEngine::_perform_update_compaction()
    @          0x42b5dbe starrocks::StorageEngine::_update_compaction_thread_callback()
    @          0x7e7a240 execute_native_thread_routine
    @     0x7f011642344b start_thread
    @     0x7f0115a1252f __GI___clone
    @                0x0 (unknown)
  1. PrimaryKey 表相同的 Tablet 频繁做 compaction, 导致 IO 占用高

通过 iotop 可以看到 update_apply 占用大量磁盘IO

I0614 14:30:23.446010 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:30:23.446035 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:20.7 score:-2513868 pick:1/valid:5/all:6 65 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:30:24.119956 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:30:24.119973 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:20.8 score:-2513868 pick:1/valid:5/all:6 66 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:30:24.781566 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:30:24.781575 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:20.9 score:-2513868 pick:1/valid:5/all:6 67 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:30:25.454936 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:30:25.454952 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:20.10 score:-2513868 pick:1/valid:5/all:6 68 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:30:26.165084 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:30:26.165097 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:20.11 score:-2513868 pick:1/valid:5/all:6 69 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:31:26.828867 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:31:26.828908 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:21 score:-2513868 pick:1/valid:5/all:7 70 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:32:27.552858 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171288
I0614 14:32:27.552901 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:21.1 score:-2513866 pick:1/valid:5/all:7 114 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:33:28.254227 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:33:28.254269 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:21.2 score:-2513868 pick:1/valid:5/all:7 115 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:33:28.971200 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:33:28.971218 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:21.3 score:-2513868 pick:1/valid:5/all:7 116 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:33:29.658658 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
I0614 14:33:29.658684 129903 tablet_updates.cpp:2157] update compaction start tablet:30054 version:21.4 score:-2513868 pick:1/valid:5/all:7 117 #rows:8192000->8192000 bytes:2.40 MB->2.40 MB(estimate)
I0614 14:33:30.342273 129903 tablet_manager.cpp:662] Found the best tablet to compact. compaction_type=update tablet_id=30054 highest_score=89171286
  1. _delete_tablets_on_unused_root_path crash

*** SIGABRT (@0x3e800004eaf) received by PID 20143 (TID 0x7f4fb631b700) from PID 20143; stack trace: ***
    @          0x58f9dc2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f509d0ae630 (unknown)
    @     0x7f509c5f9387 __GI_raise
    @     0x7f509c5faa78 __GI_abort
    @          0x2c5c79e starrocks::failure_function()
    @          0x58ed79d google::LogMessage::Fail()
    @          0x58efc0f google::LogMessage::SendToLog()
    @          0x58ed2ee google::LogMessage::Flush()
    @          0x58f0219 google::LogMessageFatal::~LogMessageFatal()
    @          0x3fefc29 starrocks::StorageEngine::_delete_tablets_on_unused_root_path()
    @          0x3fefd4a starrocks::StorageEngine::_start_disk_stat_monitor()
    @          0x424a3b1 starrocks::StorageEngine::_disk_stat_monitor_thread_callback()
    @          0x7e0a7e0 execute_native_thread_routine
    @     0x7f509d0a6ea5 start_thread
    @     0x7f509c6c1b0d __clone
    @                0x0 (unknown)
F0802 11:18:09.379490  3870 storage_engine.cpp:496] meet too many error disks, process exit. max_ratio_allowed=0%, error_disk_count=1, total_disk_count=1
F0802 11:43:16.782750 20644 storage_engine.cpp:496] meet too many error disks, process exit. max_ratio_allowed=0%, error_disk_count=1, total_disk_count=1
F0802 12:16:51.805621 21631 storage_engine.cpp:496] meet too many error disks, process exit. max_ratio_allowed=0%, error_disk_count=1, total_disk_count=1

原因: 磁盘坏了或是磁盘满了

  1. 3.0 版本 FE 升级失败

3.0.0 ~ 3.0.2 的版本,升级到3.0.3+ 或是 3.1 时候出现

2023-08-16 23:26:55,710 ERROR (stateChangeExecutor|73) [GlobalStateMgr.transferToLeader():1148] failed to init journal after transfer to leader! will exit
com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 91 path $.p.m2.
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:226) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:41) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:186) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:145) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
at com.google.gson.Gson.fromJson(Gson.java:963) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.fromJson(Gson.java:928) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.fromJson(Gson.java:877) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.fromJson(Gson.java:848) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.UserPrivilegeCollectionInfo.read(UserPrivilegeCollectionInfo.java:75) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalEntity.readFields(JournalEntity.java:990) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJournalCursor.deserializeData(BDBJournalCursor.java:251) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJournalCursor.next(BDBJournalCursor.java:295) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2144) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2104) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1143) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:325) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:722) ~[starrocks-fe.jar:?]
at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 91 path $.p.m2.
at com.google.gson.stream.JsonReader.beginObject(JsonReader.java:384) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:215) ~[spark-dpp-1.0.0.jar:?]
  1. CloudCannal 报错 Unsupported dataFormat value is : \N

com.clougence.cloudcanal.base.metadata.exception.DataTaskRuntimeException: start increment service failed for task(id:27,name:canalp00lphy9198_INCREMENT),msgUnsupportedOperationException: Unsupported dataFormat value is : \N
        at com.clougence.cloudcanal.task.service.impl.CanalIncrementServiceImpl.start(CanalIncrementServiceImpl.java:86)
        at com.clougence.cloudcanal.task.DataTaskStarter.startDataTask(DataTaskStarter.java:247)
        at com.clougence.cloudcanal.task.DataTaskStarter.start(DataTaskStarter.java:109)
        at com.clougence.cloudcanal.task.TaskCoreApplication.main(TaskCoreApplication.java:57)
  1. [行为变更] Truncate 后的数据是否进 Trash

旧版本: 进 Trash,过期删除

新版本: 不进 Trash, 直接删除

  1. 低基数字典不一致导致 BE crash

query_id:24ce896e-52a9-11ee-958a-52540062cc3e, fragment_instance:24ce896e-52a9-11ee-958a-52540062cd2e
*** Aborted at 1694659692 (unix time) try "date -d @1694659692" if you are using GNU date ***
PC: @          0x21e0c13 starrocks::vectorized::FixedLengthColumnBase<>::swap_column()
*** SIGSEGV (@0x52eeb18) received by PID 2778746 (TID 0x7fb02c7dc700) from PID 86960920; stack trace: ***
    @          0x4ebcb82 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fb38adbfce0 (unknown)
    @          0x21e0c13 starrocks::vectorized::FixedLengthColumnBase<>::swap_column()
    @          0x37076be starrocks::vectorized::NullableColumn::swap_column()
    @          0x36c7cc5 starrocks::vectorized::SegmentIterator::_encode_to_global_id()
    @          0x36d062b starrocks::vectorized::SegmentIterator::_do_get_next()
    @          0x36d32c0 starrocks::vectorized::SegmentIterator::do_get_next()
    @          0x37475d2 starrocks::vectorized::ProjectionIterator::do_get_next()
    @          0x3cf4d35 starrocks::SegmentIteratorWrapper::do_get_next()
    @          0x3b134a3 starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x376e8ce starrocks::vectorized::TabletReader::do_get_next()
    @          0x254ff5b starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
    @          0x255063b starrocks::pipeline::OlapChunkSource::_read_chunk()
    @          0x254002c starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
    @          0x22be4c4 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
    @          0x22cf57e starrocks::workgroup::ScanExecutor::worker_thread()
    @          0x3ef1362 starrocks::ThreadPool::dispatch_thread()
    @          0x3eebe5a starrocks::Thread::supervise_thread()
    @     0x7fb38adb51ca start_thread
    @     0x7fb38aae4ef3 __GI___clone
    @                0x0 (unknown)
  1. Spark connecotr 读 StarRocks 数据报错: Set cancelled by MemoryScratchSinkOperator

BE日志会报内存超限类似的错误

Set cancelled by MemoryScratchSinkOperator

或是 process 的内存统计小于 query_pool 内存统计

  1. 冷热数据迁移,导致 FE 死锁

"BackgroundDynamicPartitionThread" #97631 daemon prio=5 os_prio=0 cpu=0.46ms elapsed=30497.58s tid=0x00007f7f0d0f8800 nid=0x4a4e waiting on condition  [0x00007f7ce498b000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.0.1/Native Method)
        - parking to wait for  <0x0000000506370ab8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.0.1/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.0.1/ReentrantReadWriteLock.java:959)
        at com.starrocks.catalog.TabletInvertedIndex.writeLock(TabletInvertedIndex.java:118)
        at com.starrocks.catalog.TabletInvertedIndex.deleteTablet(TabletInvertedIndex.java:682)
        at com.starrocks.server.LocalMetastore.onErasePartition(LocalMetastore.java:4567)
        at com.starrocks.server.GlobalStateMgr.onErasePartition(GlobalStateMgr.java:4001)
        at com.starrocks.catalog.OlapTable.dropPartition(OlapTable.java:976)
        at com.starrocks.catalog.OlapTable.dropPartition(OlapTable.java:1002)
        at com.starrocks.server.LocalMetastore.dropPartition(LocalMetastore.java:1551)
        at com.starrocks.server.GlobalStateMgr.dropPartition(GlobalStateMgr.java:2434)
        at com.starrocks.clone.DynamicPartitionScheduler.executeDynamicPartitionForTable(DynamicPartitionScheduler.java:412)
        at com.starrocks.catalog.OlapTable.lambda$onCreate$3(OlapTable.java:2450)
        at com.starrocks.catalog.OlapTable$Lambda$3278/0x000000080187c840.run(Unknown Source)
        at java.lang.Thread.run(java.base@11.0.0.1/Thread.java:834)

   Locked ownable synchronizers:
        - <0x0000000753f065a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        
"ReportHandler" #113 daemon prio=5 os_prio=0 cpu=84177.82ms elapsed=42962.98s tid=0x00007f7f89ed1000 nid=0xa3b waiting on condition  [0x00007f7f0abee000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.0.1/Native Method)
        - parking to wait for  <0x0000000753f065a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.0.1/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:1009)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.0.1/AbstractQueuedSynchronizer.java:1324)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.0.1/ReentrantReadWriteLock.java:738)
        at com.starrocks.catalog.Database.readLock(Database.java:182)
        at com.starrocks.catalog.TabletInvertedIndex.addToTabletMigrationMap(TabletInvertedIndex.java:334)
        at com.starrocks.catalog.TabletInvertedIndex.tabletReport(TabletInvertedIndex.java:213)
        at com.starrocks.leader.ReportHandler.tabletReport(ReportHandler.java:394)
        at com.starrocks.leader.ReportHandler.access$300(ReportHandler.java:119)
        at com.starrocks.leader.ReportHandler$ReportTask.exec(ReportHandler.java:351)
        at com.starrocks.leader.ReportHandler.runOneCycle(ReportHandler.java:1459)
        at com.starrocks.common.util.Daemon.run(Daemon.java:115)       
  1. 冷热数据迁移,导致BE占用大量内存

  1. BE GlobalRuntimeFilter 内存泄漏

  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.2.0 ~ 2.2.15
    • 2.3.0 ~ 2.3.16
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.10
    • 3.0.0 ~ 3.0.5
    • 3.1.0 ~ 3.1.1
  • 修复版本:

    • 2.2.16+
    • 2.3.17+
    • 2.4.6+
    • 2.5.11+
    • 3.0.6+
    • 3.1.2+
  1. BE regex_replace 函数内存泄漏

  1. BE ES 外表内存泄漏

  1. BE avro 格式导入内存泄漏

  1. 存算分离模式下 FE 启动失败: failed to load journal type 118

2023-08-16 09:11:47,262 WARN (leaderCheckpointer|130) [GlobalStateMgr.replayJournalInner():2012] catch exception when replaying 9748222,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 118
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:981) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2001) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1953) [starrocks-fe.jar:?]
        at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:215) [starrocks-fe.jar:?]
        at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:106) [starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:73) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
        at com.starrocks.lake.StarOSAgent.getServiceId(StarOSAgent.java:101) ~[starrocks-fe.jar:?]
        at com.starrocks.lake.StarOSAgent.prepare(StarOSAgent.java:94) ~[starrocks-fe.jar:?]
        at com.starrocks.lake.StarOSAgent.getShardReplicas(StarOSAgent.java:393) ~[starrocks-fe.jar:?]
        at com.starrocks.lake.StarOSAgent.getBackendIdsByShard(StarOSAgent.java:444) ~[starrocks-fe.jar:?]
        at com.starrocks.lake.LakeTablet.getBackendIds(LakeTablet.java:88) ~[starrocks-fe.jar:?]
        at com.starrocks.server.LocalMetastore.truncateTableInternal(LocalMetastore.java:4833) ~[starrocks-fe.jar:?]
        at com.starrocks.server.LocalMetastore.replayTruncateTable(LocalMetastore.java:4862) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayTruncateTable(GlobalStateMgr.java:3520) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:574) ~[starrocks-fe.jar:?]