常见 Crash / BUG / 优化 查询

  1. FE 启动失败: failed to load journal type 12110

2023-08-09 14:07:57,985 INFO (stateChangeExecutor|66) [DatabaseTransactionMgr.replayUpsertTransactionState():1626] replay a committed transaction TransactionState. txn_id: 5242, label: insert_cf587b48-3675-11ee-8c4a-00163e1276cf, db id: 91465, table id list: 91471, callback id: -1, coordinator: FE: 172.26.80.21, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1691559011286, commit time: 1691559011331, finish time: -1, write cost: 45ms, reason:  attachment: com.starrocks.transaction.InsertTxnCommitAttachment@3bd990c2
2023-08-09 14:07:57,985 WARN (stateChangeExecutor|66) [GlobalStateMgr.replayJournalInner():2301] catch exception when replaying 26401,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 12110
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1090) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2290) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2242) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1216) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:338) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:771) ~[starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
        at com.starrocks.transaction.TransactionLogApplierFactory.create(TransactionLogApplierFactory.java:23) ~[starrocks-fe.jar:?]
        at com.starrocks.transaction.DatabaseTransactionMgr.updateCatalogAfterCommitted(DatabaseTransactionMgr.java:1526) ~[starrocks-fe.jar:?]
        at com.starrocks.transaction.DatabaseTransactionMgr.replayUpsertTransactionState(DatabaseTransactionMgr.java:1627) ~[starrocks-fe.jar:?]
        at com.starrocks.transaction.GlobalTransactionMgr.replayUpsertTransactionState(GlobalTransactionMgr.java:674) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:599) ~[starrocks-fe.jar:?]
        ... 7 more
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 3.0.0 ~ 3.0.5
  • 修复版本:

    • 3.0.6+
  • 问题原因:

    • 并发创建相同名称的table,在创建过程中删除db,同时又建了同名的库,但是table往db放的时候,对db存在性的检查是根据名称检查的,这就导致两个table都能创建成功,但是在回放日志的时候只能成功一个。
  1. FE 启动失败: Expected BEGIN_OBJECT but was STRING

2023-08-16 23:26:55,710 ERROR (stateChangeExecutor|73) [GlobalStateMgr.transferToLeader():1148] failed to init journal after transfer to leader! will exit
com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 91 path $.p.m2.
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:226) ~[spark-dpp-1.0.0.jar:?]
  1. FE 启动失败

2023-08-16 16:49:55,364 ERROR (UNKNOWN 10.18.104.101_9010_1681212541567(-1)|1) [StarRocksFE.start():170] StarRocksFE start failed
com.starrocks.sql.analyzer.SemanticException: Column '`usr_ser`.`v_dwd_usr_ser_spo_order_qty_dtl`.`execute_date`' cannot be resolved
        at com.starrocks.sql.analyzer.Scope.resolveField(Scope.java:83) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.Scope.resolveField(Scope.java:77) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.ExpressionAnalyzer$Visitor.visitSlot(ExpressionAnalyzer.java:253) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.ExpressionAnalyzer$Visitor.visitSlot(ExpressionAnalyzer.java:210) ~[starrocks-fe.jar:?]
        at com.starrocks.analysis.SlotRef.accept(SlotRef.java:489) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:41) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.ExpressionAnalyzer.bottomUpAnalyze(ExpressionAnalyzer.java:207) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.ExpressionAnalyzer.analyze(ExpressionAnalyzer.java:102) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.ExpressionAnalyzer.analyzeExpression(ExpressionAnalyzer.java:1194) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.analyzePartitionInfo(MaterializedView.java:734) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.onCreate(MaterializedView.java:700) ~[starrocks-fe.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_232]
        at com.starrocks.server.LocalMetastore.loadDb(LocalMetastore.java:326) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.loadImage(GlobalStateMgr.java:1301) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.initialize(GlobalStateMgr.java:955) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.start(StarRocksFE.java:116) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.main(StarRocksFE.java:68) ~[starrocks-fe.
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.10
    • 3.0.0 ~ 3.0.5
    • 3.1.0 ~ 3.1.1
  • 修复版本:

    • 2.5.11+
    • 3.0.6+
    • 3.1.2+
  • 问题原因:

    • 物化视图加载的过程中,base表的列发生了变化。
  1. FE 启动失败: failed to load journal type 10002

2023-07-14 14:31:56,161 WARN (stateChangeExecutor|70) [GlobalStateMgr.replayJournalInner():1914] catch exception when replaying 34812512,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 10002
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:948) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1903) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1854) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1034) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:295) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:643) [starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:86) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
        at com.starrocks.server.LocalMetastore.replayAddPartition(LocalMetastore.java:1442) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayAddPartition(GlobalStateMgr.java:2057) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:216) ~[starrocks-fe.jar:?]
        ... 7 more
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.4
  • 修复版本:

    • 3.5.5+
  • 问题原因:

    • informattion_schema上创建物化视图,因为informattion_schema上面的表是不持久化的,回放与物化视图相关的log时会报NPE
  1. FE 启动失败: failed to load journal type 17

2023-04-06 13:22:13,024 WARN (stateChangeExecutor|79) [GlobalStateMgr.replayJournal():1941] got interrupt exception or inconsistent exception when replay journal 30315695, will exit,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 17
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1031) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1987) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1939) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1097) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:316) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:690) [starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
        at com.starrocks.catalog.CatalogRecycleBin.replayRecoverTable(CatalogRecycleBin.java:579) ~[starrocks-fe.jar:?]
        at com.starrocks.server.LocalMetastore.replayRecoverTable(LocalMetastore.java:2272) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayRecoverTable(GlobalStateMgr.java:2714) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:346) ~[starrocks-fe.jar:?]
  1. FE启动失败: com.google.gson.JsonSyntaxException: duplicate key

物化视图激活失败也有可能触发这个堆栈

com.google.gson.JsonSyntaxException: duplicate key: 7417179
        at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:190) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:145) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.read(GsonUtils.java:641) ~[starrocks-fe.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:963) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:928) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:877) ~[spark-dpp-1.0.0.jar:?]
        at com.google.gson.Gson.fromJson(Gson.java:848) ~[spark-dpp-1.0.0.jar:?]
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.10
    • 3.0.0 ~ 3.0.5
    • 3.1.0 ~ 3.1.1
  • 修复版本:

    • 2.5.11+
    • 3.0.6+
    • 3.1.2+
  • 问题原因:

    • 物化视图刷新的log中的对象在序列化时有更新,导致序列化出两个相同的key。
  • 解决办法:

    • 升级到最新小版本,删除这个物化视图后重建。
  1. FE 启动失败: java.lang.IllegalArgumentException: capacity < 0

2023-08-21 21:48:10,983 ERROR (UNKNOWN 10.8.1.81_9010_1678173058506(-1)|1) [StarRocksFE.start():170] StarRocksFE start failed
java.lang.IllegalArgumentException: capacity < 0: (-2038667263 < 0)
        at java.nio.Buffer.createCapacityException(Buffer.java:256) ~[?:?]
        at java.nio.CharBuffer.allocate(CharBuffer.java:347) ~[?:?]
        at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:807) ~[?:?]
        at com.starrocks.common.io.Text.decode(Text.java:342) ~[starrocks-fe.jar:?]
        at com.starrocks.common.io.Text.decode(Text.java:321) ~[starrocks-fe.jar:?]
        at com.starrocks.common.io.Text.readString(Text.java:396) ~[starrocks-fe.jar:?]
        at com.starrocks.scheduler.TaskManager.loadTasks(TaskManager.java:518) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.loadImage(GlobalStateMgr.java:1331) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.initialize(GlobalStateMgr.java:955) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.start(StarRocksFE.java:116) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.main(StarRocksFE.java:68) ~[starrocks-fe.jar:?]
  1. FE 启动失败: failed to load journal type 10097

2023-08-26 12:07:26,524 WARN (stateChangeExecutor|91) [GlobalStateMgr.replayJournal():1883] got interrupt exception or inconsistent exception when replay journal 3070926, will exit,com.starrocks.journal.JournalInconsistentException: failed to load journal type 10097
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:954) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1930) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1881) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1050) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:298) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:655) ~[starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:86) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]Caused by: com.starrocks.sql.analyzer.SemanticException: Create materialized view from inactive materialized view: cyb_dwd_catarc_mix_merge_mon_v
        at com.starrocks.sql.analyzer.MaterializedViewAnalyzer.lambda$getBaseTableInfos$0(MaterializedViewAnalyzer.java:114) ~[starrocks-fe.jar:?]
        at java.util.HashMap.forEach(HashMap.java:1288) ~[?:1.8.0_131]
        at com.starrocks.sql.analyzer.MaterializedViewAnalyzer.getBaseTableInfos(MaterializedViewAnalyzer.java:106) ~[starrocks-fe.jar:?]
        at com.starrocks.alter.Alter.processChangeMaterializedViewStatus(Alter.java:336) ~[starrocks-fe.jar:?]
        at com.starrocks.alter.Alter.replayAlterMaterializedViewStatus(Alter.java:574) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayAlterMaterializedViewStatus(GlobalStateMgr.java:3099) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:294) ~[starrocks-fe.jar:?]
  1. FE 启动失败: failed to load journal type 10096

2023-02-21 23:20:10,217 WARN (replayer|69) [GlobalStateMgr.replayJournalInner():1723] catch exception when replaying 37314317,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 10096
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:954) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1712) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:1571) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:1636) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
        at com.starrocks.load.InsertOverwriteJobRunner.<init>(InsertOverwriteJobRunner.java:67) ~[starrocks-fe.jar:?]
        at com.starrocks.load.InsertOverwriteJobManager.replayInsertOverwriteStateChange(InsertOverwriteJobManager.java:142) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:927) ~[starrocks-fe.jar:?]
        ... 4 more
  1. FE 启动失败: audit plugin 加载失败

 FE start failed: java.lang.NoClassDefFoundError: com/starrocks/plugin/audit/AuditLoaderPlugin$AuditLoaderConf
  1. FE启动失败: failed to load journal type 10081

2023-09-16 12:02:04,536 WARN (replayer|66) [GlobalStateMgr.replayJournalInner():1963] catch exception when replaying 87,com.starrocks.journal.JournalInconsistentException: failed to load journal type 10081
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:967) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1952) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:1809) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:1874) ~[starrocks-fe.jar:?]Caused by: java.lang.NullPointerException
        at com.starrocks.scheduler.TaskRunBuilder.build(TaskRunBuilder.java:37) ~[starrocks-fe.jar:?]
        at com.starrocks.scheduler.TaskManager.replayCreateTaskRun(TaskManager.java:597) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:669) ~[starrocks-fe.jar:?]
        ... 4 more
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.12
    • 3.0.0 ~ 3.0.6
    • 3.1.0 ~ 3.1.3
  • 修复版本:

    • 2.5.13+
    • 3.0.7+
    • 3.1.4+
  • 问题原因:

    • Task添加字段,通过反序列化出来的默认值是null
  1. BE 启动加载元数据占用内存过高

BE 启动过程中加载元数据内存过高,启动时间长,启动成功后,内存会降下来

  1. Restore 表后,3副本出现不一致

有的副本可能为空

  1. 窗口函数 Crash

*** Aborted at 1694398352 (unix time) try "date -d @1694398352" if you are using GNU date ***
PC: @     0x7f5d104cff83 __memmove_avx_unaligned_erms
*** SIGSEGV (@0x7f5ce63ff000) received by PID 221675 (TID 0x7f5a8d70c700) from PID 18446744073277534208; stack trace: ***
    @          0x5b1ba42 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f5d11412ce0 (unknown)
    @     0x7f5d104cff83 __memmove_avx_unaligned_erms
    @          0x2d41ce4 starrocks::vectorized::FixedLengthColumnBase<>::remove_first_n_values()
    @          0x3071e03 (unknown)
    @          0x3077bf0 starrocks::pipeline::LocalPartitionTopnContext::push_one_chunk_to_partitioner()
    @          0x3052899 starrocks::pipeline::LocalPartitionTopnSinkOperator::push_chunk()
    @          0x2d7bace starrocks::pipeline::PipelineDriver::process()
    @          0x51333fa starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x4b17352 starrocks::ThreadPool::dispatch_thread()
    @          0x4b11e4a starrocks::Thread::supervise_thread()
    @     0x7f5d114081cf start_thread
    @     0x7f5d10439d83 __GI___clone
    @                0x0 (unknown)
  1. Date_diff 函数 crash

*** SIGSEGV (@0x0) received by PID 135714 (TID 0x7fdd8bbfd700) from PID 0; stack trace: ***
    @          0x6e57cb2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fdee909c630 (unknown)
    @          0x5223dd2 starrocks::TimeFunctions::datediff()
    @          0x50c2de4 starrocks::VectorizedFunctionCallExpr::evaluate_checked()
    @          0x482e1d3 starrocks::ExprContext::evaluate()
    @          0x482e51f starrocks::ExprContext::evaluate()
    @          0x3e2bed4 starrocks::pipeline::ProjectOperator::push_chunk()
    @          0x3eea3e4 starrocks::pipeline::PipelineDriver::process()
    @          0x3ed892e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x5fe477a starrocks::ThreadPool::dispatch_thread()
    @          0x5fdeaca starrocks::Thread::supervise_thread()
    @     0x7fdee9094ea5 start_thread
    @     0x7fdee86afb0d __clone
    @                0x0 (unknown)
  1. Union的时候,列复用导致随机 crash

PC: @          0x1a06303 starrocks::TabletMeta::max_version()
*** SIGSEGV (@0x8) received by PID 126209 (TID 0x7f6172bc0700) from PID 8; stack trace: ***
    @          0x3db9592 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f62944dd5d0 (unknown)
    @          0x1a06303 starrocks::TabletMeta::max_version()
    @          0x19e241a starrocks::Tablet::rowset_with_max_version()
    @          0x19e5073 starrocks::Tablet::can_do_compaction()
    @          0x19f7898 starrocks::TabletManager::find_best_tablet_to_compaction()
    @          0x19d2b35 starrocks::StorageEngine::_perform_cumulative_compaction()
    @          0x19bf09d starrocks::StorageEngine::_cumulative_compaction_thread_callback()
    @          0x57f1810 execute_native_thread_routine
    @     0x7f62944d5dd5 start_thread
    @     0x7f6293af0ead __clone
    @                0x0 (unknown)
  1. BE stream load 导入死锁

admin execute on 10004 ‘System.print(ExecEnv.get_stack_trace_for_all_threads())’; 看有下面典型的堆栈: (10004是BE_ID, show backends 可以看到)

48 tids: 495840,495841,495842,495843,495844,495845,495846,495847,495848,495849,495850,495851,495852,495853,495854,495855,495856,495857,495858,495859,495860,495861,495862,495863,495864,495865,495866,495867,495868,495869,495870,495871,495872,495873,495874,495875,495876,495877,495878,495879,495880,495881,495882,495883,495884,495885,495886,495887
    0x7fddf41edaf7  syscall
         0x91db983  std::__atomic_futex_unsigned_base::_M_futex_wait_until()
         0x531efa0  starrocks::StreamLoadAction::_handle()
         0x531f4c1  starrocks::StreamLoadAction::handle()
         0x5ef5277  evhttp_handle_request
         0x5ef5f23  bufferevent_readcb
         0x5ee2662  event_process_active_single_queue
         0x5ee2d9f  event_base_loop
         0x53058c4  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
         0x920b9d0  execute_native_thread_routine
    0x7fddf4455fa3  start_thread
    0x7fddf41f306f  clone
             (nil)  (unknown)
48 tids: 495214,495215,495216,495217,495218,495219,495220,495221,495222,495223,495224,495225,495226,495227,495228,495229,495230,495231,495232,495233,495234,495235,495236,495237,495238,495239,495240,495241,495242,495243,495244,495245,495246,495247,495248,495249,495250,495251,495252,495253,495254,495255,495256,495257,495258,495259,495260,495261
    0x7fddf445c00a  __pthread_cond_wait
         0x91a063c  std::condition_variable::wait()
         0x4cf9cd6  starrocks::StreamLoadPipe::read()
         0x335e25d  starrocks::vectorized::JsonReader::_read_and_parse_json()
         0x3362447  starrocks::vectorized::JsonScanner::_open_next_reader()
         0x3363cda  starrocks::vectorized::JsonScanner::get_next()
         0x53c76d1  starrocks::connector::FileDataSource::get_next()
         0x3460c45  starrocks::vectorized::ConnectorScanNode::_scanner_thread()
         0x4c7c2f0  starrocks::PriorityThreadPool::work_thread()
         0x5e29b67  thread_proxy
    0x7fddf4455fa3  start_thread
    0x7fddf41f306f  clone
             (nil)  (unknown)
120 tids: 495264,495265,495266,495267,495268,495269,495270,495271,495272,495273,495274,495275,495276,495277,495278,495279,495280,495281,495282,495283,495284,495285,495286,495287,495288,495289,495290,495291,495292,495293,495294,495295,495296,495297,495298,495299,495300,495301,495302,495303,495304,495305,495306,495307,495308,495309,495310,495311,495312,495313,495314,495315,495316,495317,495318,495319,495320,495321,495322,495323,495324,495325,495326,495327,2154605,2154606,2154607,2154608,2154610,2154671,2154964,2154965,2154966,2154967,2154969,2154984,2155017,2155039,2155069,2155070,2155080,2155081,2155082,2155083,2155084,2155085,2155086,2155087,2155088,2155113,2155125,2155128,2155130,2155131,2155132,2155133,2155148,2155153,2155154,2155335,2155336,2155349,2155354,2155365,2155404,2155417,2155432,2155451,2155461,2155462,2155485,2155486,2155503,2155509,2155521,2155576,2155639,2155712,2155713,2155733
    0x7fddf445c00a  __pthread_cond_wait
         0x91a063c  std::condition_variable::wait()
         0x3460520  starrocks::vectorized::ConnectorScanNode::get_next()
         0x4d4ad53  starrocks::PlanFragmentExecutor::_get_next_internal_vectorized()
         0x4d4b140  starrocks::PlanFragmentExecutor::_open_internal_vectorized()
         0x4d4d2dd  starrocks::PlanFragmentExecutor::open()
         0x4c9e71b  starrocks::FragmentExecState::execute()
         0x4ca4993  starrocks::FragmentMgr::exec_actual()
         0x4e43062  starrocks::ThreadPool::dispatch_thread()
         0x4e3db5a  starrocks::Thread::supervise_thread()
    0x7fddf4455fa3  start_thread
    0x7fddf41f306f  clone
             (nil)  (unknown)
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.12
    • 3.0.0 ~ 3.0.6
    • 3.1.0 ~ 3.1.3
  • 修复版本:

    • 2.5.13+
    • 3.0.7+
    • 3.1.4+
  • 问题原因:

  • 临时解决办法:

    • 修改be.conf, 调大这两个配置

webserver_num_workers=128 (默认48)
scanner_thread_pool_thread_num=128 (默认48)

  1. group_concat crash

*** Aborted at 1682471071 (unix time) try "date -d @1682471071" if you are using GNU date ***
PC: @          0x3534a66 starrocks::vectorized::GroupConcatAggregateFunction<>::finalize_to_column()
*** SIGSEGV (@0x7f6c017fd000) received by PID 3835823 (TID 0x7f7901aea700) from PID 25153536; stack trace: ***
    @          0x5824342 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f7a97ef44c0 (unknown)
    @          0x3534a66 starrocks::vectorized::GroupConcatAggregateFunction<>::finalize_to_column()
    @          0x3058a19 starrocks::Aggregator::_finalize_to_chunk()
    @          0x30aa4f6 starrocks::Aggregator::convert_to_chunk_no_groupby()
    @          0x2fb5690 starrocks::pipeline::AggregateBlockingSourceOperator::pull_chunk()
    @          0x2ca7d73 starrocks::pipeline::PipelineDriver::process()
    @          0x4ec2213 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x48c3b92 starrocks::ThreadPool::dispatch_thread()
    @          0x48be68a starrocks::Thread::supervise_thread()
    @     0x7f7a97ee9f1b (unknown)
    @     0x7f7a97c8c1a0 clone
    @                0x0 (unknown)
  1. Schema change 一直失败

BE 有这种日志

get base tablet rowsets error tablet
  1. group_concat crash

query_id:e501e17d-68e1-11ee-9050-005056aadc5e, fragment_instance:e501e17d-68e1-11ee-9050-005056aadc65
*** Aborted at 1697102991 (unix time) try "date -d @1697102991" if you are using GNU date ***
PC: @          0x2cc9888 starrocks::vectorized::GroupConcatAggregateFunction<>::convert_to_serialize_format()
*** SIGSEGV (@0x2d0) received by PID 2437363 (TID 0x7f4e25ef5700) from PID 720; stack trace: ***
    @          0x3f973c2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f4eacc12ce0 (unknown)
    @          0x2cc9888 starrocks::vectorized::GroupConcatAggregateFunction<>::convert_to_serialize_format()
    @          0x2d69266 starrocks::vectorized::NullableAggregateFunctionVariadic<>::convert_to_serialize_format()
    @          0x2a624d0 starrocks::Aggregator::output_chunk_by_streaming()
    @          0x29fd67f starrocks::pipeline::AggregateStreamingSinkOperator::_push_chunk_by_auto()
    @          0x2a0415d starrocks::pipeline::AggregateStreamingSinkOperator::push_chunk()
    @          0x29e45a7 starrocks::pipeline::PipelineDriver::process()
    @          0x29da46e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x22395b9 starrocks::ThreadPool::dispatch_thread()
    @          0x223516a starrocks::Thread::supervise_thread()
    @     0x7f4eacc081cf start_thread
    @     0x7f4eabc39d83 __GI___clone
    @                0x0 (unknown)