com.starrocks.journal.JournalInconsistentException: failed to load journal type 118

【详述】问题详细描述
创建一个broker load任务,导致集群的fe和be直接挂掉,服务都不可用了
【背景】做过哪些操作?
创建一个broker load任务,导致集群的fe和be直接挂掉
【业务影响】
【StarRocks版本】例如:3.0.0存算分离版本
【集群规模】例如:3fe(1 follower+2observer)+6be(fe与be分开部署)
【机器信息】CPU虚拟核/内存/网卡,例如:16C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
社区群13-Sltily.w fantasticmao@gmail.com
【附件】

2023-06-01 06:23:44,531 INFO (leaderCheckpointer|189) [EditLog.loadJournal():202] Begin to unprotect create table. db = flow table = 493354
2023-06-01 06:23:44,533 WARN (leaderCheckpointer|189) [GlobalStateMgr.replayJournalInner():2012] catch exception when replaying 201672,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 118
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:981) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2001) [starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1953) [starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:215) [starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:106) [starrocks-fe.jar:?]
at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:73) [starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
at com.starrocks.lake.StarOSAgent.getServiceId(StarOSAgent.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.prepare(StarOSAgent.java:94) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.getShardReplicas(StarOSAgent.java:393) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.getBackendIdsByShard(StarOSAgent.java:444) ~[starrocks-fe.jar:?]
at com.starrocks.lake.LakeTablet.getBackendIds(LakeTablet.java:88) ~[starrocks-fe.jar:?]
at com.starrocks.server.LocalMetastore.truncateTableInternal(LocalMetastore.java:4833) ~[starrocks-fe.jar:?]
at com.starrocks.server.LocalMetastore.replayTruncateTable(LocalMetastore.java:4862) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayTruncateTable(GlobalStateMgr.java:3520) ~[starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:574) ~[starrocks-fe.jar:?]
… 6 more
2023-06-01 06:23:44,533 WARN (leaderCheckpointer|189) [GlobalStateMgr.replayJournal():1955] got interrupt exception or inconsistent exception when replay journal 201672, will exit,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 118
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:981) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2001) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:1953) [starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:215) [starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:106) [starrocks-fe.jar:?]
at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:73) [starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
at com.starrocks.lake.StarOSAgent.getServiceId(StarOSAgent.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.prepare(StarOSAgent.java:94) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.getShardReplicas(StarOSAgent.java:393) ~[starrocks-fe.jar:?]
at com.starrocks.lake.StarOSAgent.getBackendIdsByShard(StarOSAgent.java:444) ~[starrocks-fe.jar:?]
at com.starrocks.lake.LakeTablet.getBackendIds(LakeTablet.java:88) ~[starrocks-fe.jar:?]
at com.starrocks.server.LocalMetastore.truncateTableInternal(LocalMetastore.java:4833) ~[starrocks-fe.jar:?]
at com.starrocks.server.LocalMetastore.replayTruncateTable(LocalMetastore.java:4862) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayTruncateTable(GlobalStateMgr.java:3520) ~[starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:574) ~[starrocks-fe.jar:?]
… 6 more
2023-06-01 06:23:44,535 INFO (Thread-52|114) [StarRocksFE.lambda$addShutdownHook$1():368] start to execute shutdown hook
2023-06-01 06:23:44,553 INFO (Thread-52|114) [StarRocksFE.lambda$addShutdownHook$1():393] shutdown hook end

这是一个已知的问题, 已经在刚刚发布的3.0.1版本里解决了

3.0.0版本有啥办法可以避免吗?还是说现在只能升级才能解决这个问题

升级才能解决 , https://github.com/StarRocks/starrocks/pull/23507 这个是对应的pr ,升级到3.0.1版本吧


升级到3.0.1了还是不行呢
image

最新的报错信息:

kubectl get pods看看FE pod是否都正常启动了, show frontends;能否看到所有FE都在线alive

提交broker load任务之后,fe都挂了,一直在重启。

能把三台机器上的fe.out, fe.log打包上传到这儿吗? 我们看一下日志, 分析一下问题.

fe-0.tar.gz (20.7 MB) fe.out (6.0 MB) fe-1.tar.gz (26.0 MB) fe-2.tar.gz (32.6 MB)

麻烦大佬看看是啥情况呢

集群不做broker load, 只做普通查询时, 能正常工作不?

可以的,不做broker load一切都正常。一启动broker load集群就挂了,还恢复不了

升级到3.0.1后, FE edit log 重放的问题解决了. FE集群down主要是因为jindosdk的问题.

见如下fe.out日志

/lib/jvm/default-java/bin/java: symbol lookup error: /tmp/libjindosdk-180de20f20aab89c_20221124_025402.so: undefined symbol: _dl_sym, version GLIBC_PRIVATE

broke load使用jindosdk用oss导入ORC数据文件, jindosdk在ubuntu机器上加载libjindosdk.so失败导致FE JVM直接退出.

使用aws s3 sdk后可以正常导入数据
建议参考 文档: https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-manipulation/BROKER%20LOAD#other-s3-compatible-storage-system

改用s3://方式导入数据.

1赞

我的
[root@yry-dev-data-1 bin]# ./show_fe_version.sh
Commit hash: f8ff06d
Build type: RELEASE
Build time: 2023-07-18 08:19:29
Build user: StarRocks@localhost
Java compile version: openjdk full version “1.8.0_362-b08”

/opt/StarRocks-3.0.4/fe/

升级到 3.0.5 也报同样的错

2023-09-06 09:22:12,443 WARN (stateChangeExecutor|73) [GlobalStateMgr.replayJournalInner():2163] catch exception when replaying 27714988,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 100
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1060) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2152) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2104) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1143) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:325) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:722) ~[starrocks-fe.jar:?]
at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
at com.starrocks.statistic.AnalyzeMgr.updateLoadRows(AnalyzeMgr.java:493) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.DatabaseTransactionMgr.updateCatalogAfterVisible(DatabaseTransactionMgr.java:1553) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.DatabaseTransactionMgr.replayUpsertTransactionState(DatabaseTransactionMgr.java:1641) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.GlobalTransactionMgr.replayUpsertTransactionState(GlobalTransactionMgr.java:639) ~[starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:600) ~[starrocks-fe.jar:?]
… 7 more
2023-09-06 09:22:12,452 WARN (stateChangeExecutor|73) [GlobalStateMgr.replayJournal():2106] got interrupt exception or inconsistent exception when replay journal 27714988,
will exit,
com.starrocks.journal.JournalInconsistentException: failed to load journal type 100
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:1060) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournalInner(GlobalStateMgr.java:2152) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.replayJournal(GlobalStateMgr.java:2104) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.transferToLeader(GlobalStateMgr.java:1143) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.access$100(GlobalStateMgr.java:325) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr$1.transferToLeader(GlobalStateMgr.java:722) ~[starrocks-fe.jar:?]
at com.starrocks.ha.StateChangeExecutor.runOneCycle(StateChangeExecutor.java:103) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: java.lang.NullPointerException
at com.starrocks.statistic.AnalyzeMgr.updateLoadRows(AnalyzeMgr.java:493) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.DatabaseTransactionMgr.updateCatalogAfterVisible(DatabaseTransactionMgr.java:1553) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.DatabaseTransactionMgr.replayUpsertTransactionState(DatabaseTransactionMgr.java:1641) ~[starrocks-fe.jar:?]
at com.starrocks.transaction.GlobalTransactionMgr.replayUpsertTransactionState(GlobalTransactionMgr.java:639) ~[starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.loadJournal(EditLog.java:600) ~[starrocks-fe.jar:?]
… 7 more