fe节点宕机,重启后正常,过段时间还会宕机

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】fe节点宕机,重启后正常,过段时间还会宕机
【背景】定时的insert into select、streamload、TRUNCATE
【业务影响】影响数据
【是否存算分离】否
【StarRocks版本】2.5.22
【集群规模】例如:4fe(2 follower+1observer)+4be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:32C/64G/万兆
【联系方式】社区群10-A (1959533192@qq.com)
【附件】
2024-07-09 04:50:24,136 INFO (tablet scheduler|42) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 0.995920866808808, low/mid/high: 2/0/2
2024-07-09 04:50:30,111 INFO (consistency checker|28) [ConsistencyChecker.chooseTablets():364] choose tablets from db[information_schema-1](with read lock held) took 0ms
2024-07-09 04:50:30,111 INFO (consistency checker|28) [ConsistencyChecker.chooseTablets():364] choose tablets from db[data_zucc-11150](with read lock held) took 0ms
2024-07-09 04:50:30,111 INFO (consistency checker|28) [ConsistencyChecker.chooseTablets():364] choose tablets from db[data_zucc_sys-11151](with read lock held) took 0ms
2024-07-09 04:50:30,112 INFO (consistency checker|28) [ConsistencyChecker.chooseTablets():364] choose tablets from db[data_zucc_personal-11152](with read lock held) took 1ms
2024-07-09 04:50:30,112 INFO (consistency checker|28) [ConsistencyChecker.chooseTablets():364] choose tablets from db[statistics-10002](with read lock held) took 0ms
2024-07-09 04:50:32,289 ERROR (thrift-server-pool-48290|50450) [ProcessFunction.process():47] Internal error processing forward
com.sleepycat.je.EnvironmentFailureException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) 10.61.186.228_9010_1684911580924(2):/home/StarRocks/meta/bdb Latch timeout. com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[LogFlusher,5,main] currentTime: 1720471832224 exclusiveOwner: Thread[MASTER 10.61.186.228_9010_1684911580924(2),5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Environment.checkOpen(Environment.java:2473) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Database.checkEnv(Database.java:2287) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Database.closeInternalWork(Database.java:583) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Database.closeInternal(Database.java:513) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Database.close(Database.java:425) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBJEJournal.getMaxJournalId(BDBJEJournal.java:144) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.getMaxJournalId(GlobalStateMgr.java:2847) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.proxyExecute(ConnectProcessor.java:681) ~[starrocks-fe.jar:?]
at com.starrocks.service.FrontendServiceImpl.forward(FrontendServiceImpl.java:882) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:2651) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:2631) ~[starrocks-fe.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) 10.61.186.228_9010_1684911580924(2):/home/StarRocks/meta/bdb Latch timeout. com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[LogFlusher,5,main] currentTime: 1720471832224 exclusiveOwner: Thread[MASTER 10.61.186.228_9010_1684911580924(2),5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:443) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.latch.LatchSupport.handleTimeout(LatchSupport.java:211) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.latch.LatchWithStatsImpl.acquireExclusive(LatchWithStatsImpl.java:87) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogBufferPool.bumpCurrent(LogBufferPool.java:527) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogManager.flushInternal(LogManager.java:1260) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogManager.flushSync(LogManager.java:1210) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogFlusher$FlushTask.run(LogFlusher.java:230) ~[je-7.3.7.jar:7.3.7]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_121]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_121]
2024-07-09 04:50:58,005 ERROR (thrift-server-pool-48680|50842) [ProcessFunction.process():47] Internal error processing forward
com.sleepycat.je.EnvironmentFailureException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) 10.61.186.228_9010_1684911580924(2):/home/StarRocks/meta/bdb Latch timeout. com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[LogFlusher,5,main] currentTime: 1720471832224 exclusiveOwner: Thread[MASTER 10.61.186.228_9010_1684911580924(2),5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Environment.checkOpen(Environment.java:2473) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Environment.getDatabaseNames(Environment.java:2245) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBEnvironment.getDatabaseNamesWithPrefix(BDBEnvironment.java:459) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.getMaxJournalId(BDBJEJournal.java:133) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.getMaxJournalId(GlobalStateMgr.java:2847) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.proxyExecute(ConnectProcessor.java:681) ~[starrocks-fe.jar:?]
at com.starrocks.service.FrontendServiceImpl.forward(FrontendServiceImpl.java:882) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:2651) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:2631) ~[starrocks-fe.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) 10.61.186.228_9010_1684911580924(2):/home/StarRocks/meta/bdb Latch timeout. com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[LogFlusher,5,main] currentTime: 1720471832224 exclusiveOwner: Thread[MASTER 10.61.186.228_9010_1684911580924(2),5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:443) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.latch.LatchSupport.handleTimeout(LatchSupport.java:211) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.latch.LatchWithStatsImpl.acquireExclusive(LatchWithStatsImpl.java:87) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogBufferPool.bumpCurrent(LogBufferPool.java:527) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogManager.flushInternal(LogManager.java:1260) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogManager.flushSync(LogManager.java:1210) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.log.LogFlusher$FlushTask.run(LogFlusher.java:230) ~[je-7.3.7.jar:7.3.7]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_121]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_121]
2024-07-09 04:50:58,648 WARN (MASTER 10.61.186.228_9010_1684911580924(2)|64) [StateChangeExecutor.notifyNewFETypeTransfer():45] notify new FE type transfer: UNKNOWN
2024-07-09 04:50:58,650 INFO (stateChangeExecutor|74) [StateChangeExecutor.runOneCycle():68] begin to transfer FE type from LEADER to UNKNOWN
2024-07-09 04:50:58,650 ERROR (stateChangeExecutor|74) [StateChangeExecutor.runOneCycle():154] transfer FE type from LEADER to UNKNOWN. exit
2024-07-09 04:50:58,651 INFO (Thread-49|110) [StarRocksFE.lambda$addShutdownHook$1():354] start to execute shutdown hook
2024-07-09 04:50:58,657 WARN (UNKNOWN 10.61.186.228_9010_1684911580924(2)|64) [BDBStateChangeListener.stateChange():66] this node is DETACHED
2024-07-09 04:50:58,658 WARN (Thread-125|50843) [StarRocksFE.lambda$null$0():366] flush vlsn mapping failed

完整日志fe.warn.log (739.9 KB) 上传中:fe.log…

确实是2.5.22

是的,我之前看错了。从你贴出来的信息看是Leader FE在处理Forward时,BDB JE获取日志缓冲池(LogBufferPool)的锁(Latch)超时异常,导致FE从LEADER 转为 UNKNOWN,进而shutdown。我再分析一下日志看看能不能找到获取日志缓冲池的锁超时的原因。

您好,请问找到问题的原因了吗

2024-07-09 00:46:57,202 WARN (JournalWriter|165) [JournalWriter.updateBatchMetrics():213] slow edit log write, batch size: 1, took: 180876ms, current journal queue size: 2, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes

你监控一下FE上:/home/StarRocks/meta/bdb所在的硬盘的写入性能吧,看起来是IO存在性能瓶颈。


看那一项可以确定是io的性能瓶颈呀

如果硬盘存在性能问题,我会比较关心磁盘的利用率(%util)。


这个利用率吗,看着好像没啥问题


还是这个呀

不是这个,iostat -d -x命令里的%util。


这个吗

是的,你看看能不能写个脚本每30秒输出一次到一个文件里。等再出问题时,进行分析吧。

好的,我这边去搞一下


这次这次宕机前后的监控,我看%util没有变化呀

2024-07-12 13:10:28,136 INFO (tablet scheduler|42) [TabletScheduler.updateClusterLoadStatistic():479] update cluster load statistic:
be id: 11001, is available: true, mediums: [{medium: HDD, replica: 610, used: 7509445459, total: 371GB, score: 0.7244031014488379},{medium: SSD, replica: 0, used: 0, total: 0B, score: NaN},], paths: [{path: /home/StarRocks/storage, path hash: 1742208832857918068, be: 11001, medium: HDD, used: 7509445459, total: 398401350483},]
be id: 11010, is available: true, mediums: [{medium: HDD, replica: 630, used: 7634665910, total: 396.5GB, score: 0.7292142205142237},{medium: SSD, replica: 0, used: 0, total: 0B, score: NaN},], paths: [{path: /home/StarRocks/storage, path hash: 2955232950609066391, be: 11010, medium: HDD, used: 7634665910, total: 425773937078},]
be id: 11006, is available: true, mediums: [{medium: HDD, replica: 630, used: 26004797432, total: 395.5GB, score: 1.263264450478912},{medium: SSD, replica: 0, used: 0, total: 0B, score: NaN},], paths: [{path: /home/StarRocks/storage, path hash: -4845420638672259454, be: 11006, medium: HDD, used: 26004797432, total: 424685008888},]
be id: 11005, is available: true, mediums: [{medium: HDD, replica: 610, used: 26696196953, total: 395.4GB, score: 1.2673095742136868},{medium: SSD, replica: 0, used: 0, total: 0B, score: NaN},], paths: [{path: /home/StarRocks/storage, path hash: 5700248459953142804, be: 11005, medium: HDD, used: 26696196953, total: 424632775513},]

2024-07-12 13:10:34,308 INFO (colocate group clone checker|160) [ColocateTableBalancer.matchGroups():888] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2024-07-12 13:10:35,633 INFO (leaderCheckpointer|152) [BDBJEJournal.getFinalizedJournalId():259] database names: 77158915 77208915
2024-07-12 13:10:35,633 INFO (leaderCheckpointer|152) [Checkpoint.runAfterCatalogReady():82] checkpoint imageVersion 77208914, checkpointVersion 77208914
2024-07-12 13:10:41,395 INFO (ReportHandler|107) [ReportHandler.tabletReport():355] backend[11005] reports 632 tablet(s). report version: 17200177994425
2024-07-12 13:10:41,397 INFO (ReportHandler|107) [TabletInvertedIndex.tabletReport():289] finished to do tablet diff with backend[11005]. sync: 0. metaDel: 0. foundValid: 610. foundInvalid: 0. migration: 0. found invalid transactions 0. found republish transactions 0 cost: 1 ms
2024-07-12 13:10:41,398 INFO (ReportHandler|107) [ReportHandler.addDropReplicaTask():882] delete tablet[14722304] from backend[11005] because invalid meta, replica is enough[1-1]
2024-07-12 13:10:41,398 INFO (ReportHandler|107) [ReportHandler.deleteFromBackend():992] delete 1 tablet(s), add 0 replica(s) to meta, backend[11005]
2024-07-12 13:10:41,514 ERROR (JournalWriter|151) [BDBJEJournal.batchWriteCommit():404] failed to commit journal after retried 1 times! txn[] db[CloseSafeDatabase{db=77208915}]
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -92424391 VLSN: 159,290,572, initiated at: 13:10:31. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 10000ms. FeederState=10.61.186.228_9010_1684911580924(2)[MASTER]
Current feeds:
10.61.186.229_9010_1684912037496: feederVLSN=159,290,573 replicaTxnEndVLSN=159,290,570
10.61.186.227_9010_1720181205078: feederVLSN=159,290,573 replicaTxnEndVLSN=159,290,570

at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Transaction.commit(Transaction.java:337) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteCommit(BDBJEJournal.java:399) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:116) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:76) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]

2024-07-12 13:10:46,017 INFO (tablet checker|43) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: true, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
2024-07-12 13:10:46,020 INFO (tablet checker|43) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: false, unhealthy/total/added/in_sched/not_ready: 0/2176/0/0/0, cost: 3 ms, in lock time: 3 ms, wait time: 0ms
2024-07-12 13:10:46,021 INFO (tablet checker|43) [TabletChecker.runAfterCatalogReady():202] TStat :
TStat num of tablet check round: 11692 (+1)
TStat cost of tablet check(ms): 1542434 (+3)
TStat num of tablet checked in tablet checker: 25441792 (+2176)
TStat num of unhealthy tablet checked in tablet checker: 484 (+0)
TStat num of tablet being added to tablet scheduler: 484 (+0)
TStat num of tablet schedule round: 233790 (+20)
TStat cost of tablet schedule(ms): 5845 (+0)
TStat num of tablet being scheduled: 3899 (+0)
TStat num of tablet being scheduled succeeded: 954 (+0)
TStat num of tablet being scheduled failed: 2773 (+0)
TStat num of tablet being scheduled discard: 5 (+0)
TStat num of tablet priority upgraded: 0 (+0)
TStat num of clone task: 481 (+0)
TStat num of clone task succeeded: 480 (+0)
TStat num of clone task failed: 0 (+0)
TStat num of clone task timeout: 0 (+0)
TStat num of replica missing error: 0 (+0)
TStat num of replica version missing error: 11 (+0)
TStat num of replica unavailable error: 0 (+0)
TStat num of replica redundant error: 1566 (+0)
TStat num of replica missing in cluster error: 0 (+0)
TStat num of balance scheduled: 2322 (+0)
TStat num of colocate replica mismatch: 0 (+0)
TStat num of colocate replica redundant: 0 (+0)
TStat num of colocate balancer running round: 0 (+0)

2024-07-12 13:10:46,516 WARN (JournalWriter|151) [BDBJEJournal.rebuildCurrentTransaction():424] transaction is invalid, rebuild the txn with 1 kvs
2024-07-12 13:10:46,516 INFO (JournalWriter|151) [BDBJEJournal.rebuildCurrentTransaction():443] rebuild txn succeed. new txn
2024-07-12 13:10:48,143 INFO (tablet scheduler|42) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 0.9960488320822042, low/mid/high: 2/0/2
2024-07-12 13:10:54,308 INFO (colocate group clone checker|160) [ColocateTableBalancer.matchGroups():888] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2024-07-12 13:10:56,518 ERROR (JournalWriter|151) [BDBJEJournal.batchWriteCommit():404] failed to commit journal after retried 2 times! txn[] db[CloseSafeDatabase{db=77208915}]
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -92424392 VLSN: 159,290,574, initiated at: 13:10:46. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 10000ms. FeederState=10.61.186.228_9010_1684911580924(2)[MASTER]
Current feeds:
10.61.186.229_9010_1684912037496: feederVLSN=159,290,575 replicaTxnEndVLSN=159,290,570
10.61.186.227_9010_1720181205078: feederVLSN=159,290,575 replicaTxnEndVLSN=159,290,570

at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Transaction.commit(Transaction.java:337) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteCommit(BDBJEJournal.java:399) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:116) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:76) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]

2024-07-12 13:11:01,519 WARN (JournalWriter|151) [BDBJEJournal.rebuildCurrentTransaction():424] transaction is invalid, rebuild the txn with 1 kvs
2024-07-12 13:11:01,519 INFO (JournalWriter|151) [BDBJEJournal.rebuildCurrentTransaction():443] rebuild txn succeed. new txn
2024-07-12 13:11:01,559 INFO (ReportHandler|107) [ReportHandler.tabletReport():355] backend[11001] reports 632 tablet(s). report version: 17200176034427
2024-07-12 13:11:01,562 INFO (ReportHandler|107) [TabletInvertedIndex.tabletReport():289] finished to do tablet diff with backend[11001]. sync: 0. metaDel: 0. foundValid: 610. foundInvalid: 0. migration: 0. found invalid transactions 0. found republish transactions 0 cost: 1 ms
2024-07-12 13:11:01,562 INFO (ReportHandler|107) [ReportHandler.addDropReplicaTask():882] delete tablet[14722324] from backend[11001] because invalid meta, replica is enough[1-1]
2024-07-12 13:11:01,562 INFO (ReportHandler|107) [ReportHandler.deleteFromBackend():992] delete 1 tablet(s), add 0 replica(s) to meta, backend[11001]
2024-07-12 13:11:06,021 INFO (tablet checker|43) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: true, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
2024-07-12 13:11:06,025 INFO (tablet checker|43) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: false, unhealthy/total/added/in_sched/not_ready: 0/2176/0/0/0, cost: 3 ms, in lock time: 3 ms, wait time: 0ms
2024-07-12 13:11:06,025 INFO (tablet checker|43) [TabletChecker.runAfterCatalogReady():202] TStat :
TStat num of tablet check round: 11693 (+1)
TStat cost of tablet check(ms): 1542437 (+3)
TStat num of tablet checked in tablet checker: 25443968 (+2176)
TStat num of unhealthy tablet checked in tablet checker: 484 (+0)
TStat num of tablet being added to tablet scheduler: 484 (+0)
TStat num of tablet schedule round: 233810 (+20)
TStat cost of tablet schedule(ms): 5845 (+0)
TStat num of tablet being scheduled: 3899 (+0)
TStat num of tablet being scheduled succeeded: 954 (+0)
TStat num of tablet being scheduled failed: 2773 (+0)
TStat num of tablet being scheduled discard: 5 (+0)
TStat num of tablet priority upgraded: 0 (+0)
TStat num of clone task: 481 (+0)
TStat num of clone task succeeded: 480 (+0)
TStat num of clone task failed: 0 (+0)
TStat num of clone task timeout: 0 (+0)
TStat num of replica missing error: 0 (+0)
TStat num of replica version missing error: 11 (+0)
TStat num of replica unavailable error: 0 (+0)
TStat num of replica redundant error: 1566 (+0)
TStat num of replica missing in cluster error: 0 (+0)
TStat num of balance scheduled: 2322 (+0)
TStat num of colocate replica mismatch: 0 (+0)
TStat num of colocate replica redundant: 0 (+0)
TStat num of colocate balancer running round: 0 (+0)

2024-07-12 13:11:08,149 INFO (tablet scheduler|42) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 0.996033612876761, low/mid/high: 2/0/2
2024-07-12 13:11:11,521 ERROR (JournalWriter|151) [BDBJEJournal.batchWriteCommit():404] failed to commit journal after retried 3 times! txn[] db[CloseSafeDatabase{db=77208915}]
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -92424393 VLSN: 159,290,576, initiated at: 13:11:01. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 10000ms. FeederState=10.61.186.228_9010_1684911580924(2)[MASTER]
Current feeds:
10.61.186.229_9010_1684912037496: feederVLSN=159,290,577 replicaTxnEndVLSN=159,290,570
10.61.186.227_9010_1720181205078: feederVLSN=159,290,577 replicaTxnEndVLSN=159,290,570

at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Transaction.commit(Transaction.java:337) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteCommit(BDBJEJournal.java:399) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:116) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:76) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]

2024-07-12 13:11:11,521 WARN (JournalWriter|151) [JournalWriter.writeOneBatch():122] failed to commit batch, will abort current 1 journals.
com.starrocks.journal.JournalException: failed to commit journal after retried 3 times! txn[] db[CloseSafeDatabase{db=77208915}]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteCommit(BDBJEJournal.java:405) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:116) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:76) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -92424393 VLSN: 159,290,576, initiated at: 13:11:01. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 10000ms. FeederState=10.61.186.228_9010_1684911580924(2)[MASTER]
Current feeds:
10.61.186.229_9010_1684912037496: feederVLSN=159,290,577 replicaTxnEndVLSN=159,290,570
10.61.186.227_9010_1720181205078: feederVLSN=159,290,577 replicaTxnEndVLSN=159,290,570

at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.Transaction.commit(Transaction.java:337) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteCommit(BDBJEJournal.java:399) ~[starrocks-fe.jar:?]
... 3 more

2024-07-12 13:11:11,522 WARN (JournalWriter|151) [BDBJEJournal.batchWriteAbort():460] failed to abort transaction because no running transaction, will just ignore and return.
2024-07-12 13:11:11,522 ERROR (JournalWriter|151) [JournalWriter.abortJournalTask():163] failed to commit journal after retried 3 times! txn[] db[CloseSafeDatabase{db=77208915}]
2024-07-12 13:11:11,523 INFO (Thread-49|111) [StarRocksFE.lambda$addShutdownHook$1():354] start to execute shutdown hook
2024-07-12 13:11:11,533 INFO (Thread-49|111) [StarRocksFE.lambda$addShutdownHook$1():379] shutdown hook end

你好,有看出什么问题吗

2024-07-12 13:10:56,518 ERROR (JournalWriter|151) [BDBJEJournal.batchWriteCommit():404] failed to commit journal after retried 2 times! txn[] db[CloseSafeDatabase{db=77208915}]
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -92424392 VLSN: 159,290,574, initiated at: 13:10:46. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 10000ms. FeederState=10.61.186.228_9010_1684911580924(2)[MASTER]
Current feeds:
10.61.186.229_9010_1684912037496: feederVLSN=159,290,575 replicaTxnEndVLSN=159,290,570
10.61.186.227_9010_1720181205078: feederVLSN=159,290,575 replicaTxnEndVLSN=159,290,570

这里看BDB主节点(Master)在提交事务时,未能在预期的时间内从足够数量的副本节点(Replica)收到确认,导致事务尝试三次都失败了,触发FE shutdown了。
这有几个可能:
1、网络延迟导致副本不能在超时时间内向主节点发送确认。
2、副本节点性能差,处理事务的速度较慢。
3、复制负载过高。

除了排查网络、副本节点性能外,可以试试在fe/conf/fe.conf加一下下面的参数试试: bdbje_replica_ack_timeout_second = 30

好的我这边去调整一下,然后再去看下网络是不是有问题