物化视图刷新导致FE不可用

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述

  1. FE执行物化视图刷新。通过日志发现首先尝试drop mv, 随后create materialized view, 执行成功后,开始buildPartitionsSequentially
  2. followers在几乎同一时间reply journal SinglePartitionPersistInfo, 但是这里执行的时候报了NPE

目前怀疑是由于上述操作在edit log做持久化的时候出现了顺序问题,导致FE reply journal 报错,类似问题:3.2.6版本 starrocks elm + kubernetes部署 不断的crash, 一起来马上就异常,报错都是空指针

具体时间线 - follower 节点:
07:50 load mv plan cache 报错 (用户提前删除底表)

2025-02-28 08:15:18.383+08:00 WARN (thrift-server-pool-17462638|24131372) [CachingMvPlanContextBuilder.loadMvPlanContext():130] load mv plan cache failed: high_cost_single_mv_147391_full
com.starrocks.sql.analyzer.SemanticException: Getting analyzing error. Detail message: base-table dropped: mid_tr016606_dt20221128_repair_monitor_zh.
        at com.starrocks.common.MaterializedViewExceptions.reportBaseTableNotExists(MaterializedViewExceptions.java:72) ~[starrocks-fe.jar:?]
        at com.starrocks.server.MetadataMgr.getTableChecked(MetadataMgr.java:549) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.transformation.materialization.MvUtils.getTableChecked(MvUtils.java:1417) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.lambda$getBaseTableTypes$0(MaterializedView.java:605) ~[starrocks-fe.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
        at com.starrocks.catalog.MaterializedView.getBaseTableTypes(MaterializedView.java:605) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.MvPlanContextBuilder.getPlanContext(MvPlanContextBuilder.java:40) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.CachingMvPlanContextBuilder.loadMvPlanContext(CachingMvPlanContextBuilder.java:128) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.CachingMvPlanContextBuilder.getPlanContext(CachingMvPlanContextBuilder.java:112) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.MvRewritePreprocessor.isMVValidToRewriteQuery(MvRewritePreprocessor.java:531) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getQueryRewriteStatus(MaterializedView.java:1639) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ShowExecutor.listMaterializedViewStatus(ShowExecutor.java:2748) ~[starrocks-fe.jar:?]
        at com.starrocks.service.FrontendServiceImpl.listMaterializedViews(FrontendServiceImpl.java:792) ~[starrocks-fe.jar:?]
        at com.starrocks.service.FrontendServiceImpl.listMaterializedViewStatus(FrontendServiceImpl.java:702) ~[starrocks-fe.jar:?]
        at com.starrocks.service.FrontendServiceImpl.listMaterializedViewStatus(FrontendServiceImpl.java:585) ~[starrocks-fe.jar:?]
        at com.starrocks.thrift.FrontendService$Processor$listMaterializedViewStatus.getResult(FrontendService.java:4591) ~[starrocks-fe.jar:?]
        at com.starrocks.thrift.FrontendService$Processor$listMaterializedViewStatus.getResult(FrontendService.java:4571) ~[starrocks-fe.jar:?]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40) ~[libthrift-0.20.0.jar:0.20.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40) ~[libthrift-0.20.0.jar:0.20.0]
        at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]

08:58 触发物化视图刷新
audit日志中显示提交的是insert overwrite mv_tbl

> 2025-02-28 08:58:40.273 Create Materialized View  命令转发到Master节点执行
> 2025-02-28 08:58:40.398 [OlapTable.rebuildFullSchema():714] after rebuild full schema. table high_cost_single_mv_147391_full, schema: xxx
> 2025-02-28 08:58:40.398+08:00 INFO (replayer|118) [EditLog.loadJournal():218] Begin to unprotect create materialized view. db = artnova_cache_db create materialized view = 114846700 tableName = high_cost_single_mv_147391_full
> 2025-02-28 08:58:40.409+08:00 INFO (replayer|118) [MaterializedView.setActive():534] set high_cost_single_mv_147391_full to active
> 2025-02-28 08:58:40.412+08:00 INFO (replayer|118) [CachingMvPlanContextBuilder.invalidateAstFromCache():161] Remove mv high_cost_single_mv_147391_full from ast cache
> 2025-02-28 08:58:40.413+08:00 INFO (replayer|118) [CachingMvPlanContextBuilder.putAstIfAbsent():182] Add mv high_cost_single_mv_147391_full input ast cache
> 2025-02-28 08:58:40.568+08:00 INFO (replayer|118) [GlobalStateMgr.replayJournalInner():1978] replayed journal from 18550498 - 18550500
> 2025-02-28 08:58:40.601+08:00 WARN (replayer|118) [GlobalStateMgr.replayJournalInner():1945] catch exception when replaying journal, id: 18550501, data: {"infos":[{"clazz":"SinglePartitionPersistInfo","dbId":13004,"tableId":114846700,"partition":{"id":114846710,"name":"high_cost_single_mv_147391_full_114846709","state":"NORMAL","idToSubPartition":{},"distributionInfo":{"clazz":"RandomDistributionInfo","b":1,"typeStr":"RANDOM","type":"RANDOM"},"shardGroupId":0,"isImmutable":false,"baseIndex":{"id":114846701,"state":"NORMAL","rowCount":0,"tablets":[{"clazz":"LocalTablet","replicas":[{"id":114846712,"backendId":94795810,"version":1,"minReadableVersion":0,"dataSize":0,"rowCount":0,"state":"NORMAL","lastFailedVersion":-1,"lastSuccessVersion":1},{"id":114846713,"backendId":24611538,"version":1,"minReadableVersion":0,"dataSize":0,"rowCount":0,"state":"NORMAL","lastFailedVersion":-1,"lastSuccessVersion":1},{"id":114846714,"backendId":12012,"version":1,"minReadableVersion":0,"dataSize":0,"rowCount":0,"state":"NORMAL","lastFailedVersion":-1,"lastSuccessVersion":1}],"checkedVersion":-1,"isConsistent":true,"id":114846711,"signature":-1,"lastCheckTime":-1}],"signature":-1,"lastCheckTime":-1},"idToVisibleRollupIndex":{},"idToShadowIndex":{},"visibleVersion":1,"visibleVersionTime":1740704320568,"nextVersion":2,"dataVersion":1,"nextDataVersion":2,"versionEpoch":341557946494222336,"versionTxnType":"TXN_NORMAL","signature":-1,"lastCheckTime":-1},"dataProperty":{"storageMedium":"HDD","cooldownTimeMs":253402271999000},"replicationNum":3,"isInMemory":false,"isTempPartition":true}]},
> **com.starrocks.journal.JournalInconsistentException: failed to load journal type 10242**
>         at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1208) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1934) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:1789) ~[starrocks-fe.jar:?]
>         at com.starrocks.common.util.Daemon.run(Daemon.java:107) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:1854) ~[starrocks-fe.jar:?]
> Caused by: java.lang.NullPointerException
>         at com.starrocks.server.LocalMetastore.replayAddPartition(LocalMetastore.java:1314) ~[starrocks-fe.jar:?]
>         at com.starrocks.persist.EditLog.loadJournal(EditLog.java:252) ~[starrocks-fe.jar:?]
>         ... 4 more
> 2025-02-28 08:58:40.602+08:00 WARN (replayer|118) [GlobalStateMgr$5.runOneCycle():1792] got interrupt exception or inconsistent exception when replay journal 18550501, w
> ill exit, 
> com.starrocks.journal.JournalInconsistentException: failed to load journal type 10242
>         at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1208) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:1934) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:1789) ~[starrocks-fe.jar:?]
>         at com.starrocks.common.util.Daemon.run(Daemon.java:107) ~[starrocks-fe.jar:?]
>         at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:1854) ~[starrocks-fe.jar:?]
> Caused by: java.lang.NullPointerException
>         at com.starrocks.server.LocalMetastore.replayAddPartition(LocalMetastore.java:1314) ~[starrocks-fe.jar:?]
>         at com.starrocks.persist.EditLog.loadJournal(EditLog.java:252) ~[starrocks-fe.jar:?]
>         ... 4 more
> 2025-02-28 08:58:40.604+08:00 INFO (Thread-67|135) [StarRocksFE.lambda$addShutdownHook$1():372] start to execute shutdown hook

FE- Master 节点在9.54分reply journal时出现相同报错,导致crash

【背景】做过哪些操作?
系统自动刷新MV
【业务影响】
集群不可用
【是否存算分离】

【StarRocks版本】
3.3.4
【集群规模】
5FE 和 几十+BE
【机器信息】
【联系方式】社区群-王毅
【附件】