3.2.6版本 starrocks elm + kubernetes部署 不断的crash, 一起来马上就异常,报错都是空指针

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】不断crash
【背景】做过哪些操作?重启
【业务影响】是, 生产环境
【是否存算分离】是
【StarRocks版本】3.2.6
【集群规模】1FE 1CN
【机器信息】kubernetes部署, 内存12G
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】

  • fe.log/beINFO/相应截图
    com.starrocks.journal.JournalInconsistentException: failed to load journal type 10242
    at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1179) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2369) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2318) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1312) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:346) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:815) ~[starrocks-fe.jar:?]
    at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
    at com.starrocks.common.util.Daemon.run(Daemon.java:107) ~[starrocks-fe.jar:?]
    Caused by: java.lang.NullPointerException
    at com.starrocks.server.LocalMetastore.replayAddPartition(LocalMetastore.java:1478) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayAddPartition(GlobalStateMgr.java:2562) ~[starrocks-fe.jar:?]
    at com.starrocks.persist.EditLog.loadJournal(EditLog.java:289) ~[starrocks-fe.jar:?]
    … 7 more
    2024-05-07 14:46:08.074+08:00 WARN (stateChangeExecutor|76) [GlobalStateMgr.replayJournal():2320] got interrupt exception or inconsistent exception when replay journal 2947929, will exit,
    com.starrocks.journal.JournalInconsistentException: failed to load journal type 10242
    at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1179) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2369) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2318) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1312) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:346) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:815) ~[starrocks-fe.jar:?]
    at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
    at com.starrocks.common.util.Daemon.run(Daemon.java:107) ~[starrocks-fe.jar:?]
    Caused by: java.lang.NullPointerException
    at com.starrocks.server.LocalMetastore.replayAddPartition(LocalMetastore.java:1478) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr.replayAddPartition(GlobalStateMgr.java:2562) ~[starrocks-fe.jar:?]
    at com.starrocks.persist.EditLog.loadJournal(EditLog.java:289) ~[starrocks-fe.jar:?]
    … 7 more

3.6.2 是哪个版本?这个没有

这个实例做过升级或者回滚吗?之前有过类似堆栈,你先升级到最新的小版本,看下还有没有问题,如果还有问题,在fe.conf 里面增加 metadata_journal_skip_bad_journal_ids = 2947929 ,重启看看

全新部署, 版本:3.2.6 存算分离

metadata_journal_skip_bad_journal_ids = 2947929 无效

sh bin/show_fe_version.sh 结果发一下

Build version: 3.2.6
Commit hash: 2585333
Build type: RELEASE
Build time: 2024-04-18 13:27:22
Build distributor id: ubuntu
Build user: StarRocks@localhost (Ubuntu 22.04.3 LTS)
Java compile version: openjdk full version “11.0.20+8-post-Ubuntu-1ubuntu122.04”

按照元数据损坏的解决方案是解决了的, 问题是为什么元数据会损坏

应该是代码有bug, 在删除 table 和 add partition 有并发的情况下,写 editlog 顺序不对,我让相关同学看下,你能把出问题期间的fe.log 发一下吗,需要日志分析一下

从堆栈上看3.2引入text based mv rewrite引入的。可以在重启fe的时候,添加
enable_materialized_view_text_based_rewrite = false重启。

默认对现有功能没有影响。

PS:该问题已经在后续版本中修复了:https://github.com/StarRocks/starrocks/pull/45878/files

2024-05-07 14:16:43.945+08:00 ERROR (stateChangeExecutor|76) [MaterializedView.onReload():869] reload mv failed: Table [id=1404222, name=ads_instance_id_tag_keys_values_view, type=CLOUD_NATIVE_MATERIALIZED_VIEW]
java.lang.NullPointerException: null
        at com.starrocks.sql.analyzer.AstToSQLBuilder$AST2SQLBuilderVisitor.visitTableFunction(AstToSQLBuilder.java:306) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToSQLBuilder$AST2SQLBuilderVisitor.visitTableFunction(AstToSQLBuilder.java:67) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.TableFunctionRelation.accept(TableFunctionRelation.java:85) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:68) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:64) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToStringBuilder$AST2StringBuilderVisitor.visitJoin(AstToStringBuilder.java:614) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToStringBuilder$AST2StringBuilderVisitor.visitJoin(AstToStringBuilder.java:137) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.JoinRelation.accept(JoinRelation.java:134) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:68) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:64) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToSQLBuilder$AST2SQLBuilderVisitor.visitSelect(AstToSQLBuilder.java:194) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToSQLBuilder$AST2SQLBuilderVisitor.visitSelect(AstToSQLBuilder.java:67) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.SelectRelation.accept(SelectRelation.java:242) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:68) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:64) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToStringBuilder$AST2StringBuilderVisitor.visitQueryStatement(AstToStringBuilder.java:474) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.analyzer.AstToStringBuilder$AST2StringBuilderVisitor.visitQueryStatement(AstToStringBuilder.java:137) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.QueryStatement.accept(QueryStatement.java:56) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:68) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:64) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.CachingMvPlanContextBuilder$AstKey.<init>(CachingMvPlanContextBuilder.java:50) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.CachingMvPlanContextBuilder.putAstIfAbsent(CachingMvPlanContextBuilder.java:170) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.CachingMvPlanContextBuilder.invalidateFromCache(CachingMvPlanContextBuilder.java:138) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.setActive(MaterializedView.java:494) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.onReload(MaterializedView.java:866) ~[starrocks-fe.jar:?]
        at com.starrocks.server.LocalMetastore.replayCreateTable(LocalMetastore.java:2297) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayCreateTable(GlobalStateMgr.java:3118) ~[starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.loadJournal(EditLog.java:239) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2369) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2318) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1312) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:346) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:815) ~[starrocks-fe.jar:?]
        at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:107) ~[starrocks-fe.jar:?]
2024-05-07 14:16:43.946+08:00 WARN (stateChangeExecutor|76) [MaterializedView.setInactiveAndReason():500] set ads_instance_id_tag_keys_values_view to inactive because of reload failed: null
2024-05-07 14:16:43.946+08:00 ERROR (stateChangeExecutor|76) [LocalMetastore.replayCreateTable():2299] replay create table failed: Table [id=1404222, name=ads_instance_id_tag_keys_values_view, type=CLOUD_NATIVE_MATERIALIZED_VIEW]