3个FE节点不定时间依次不可用

版本:1.19.1

监控截图

集群故障现象描述:
3个FE节点不定时间依次不可用

报错信息如下:
2022-06-18 17:58:51,104 WARN (REPLICA 10.207.8.52_9010_1637575084600(3)|63) [BDBStateChangeListener.stateChange():61] this node is DETACHED
2022-06-18 18:00:18,757 WARN (UNKNOWN 10.207.8.52_9010_1637575084600(-1)|1) [Catalog.notifyNewFETypeTransfer():2314] notify new FE type transfer: UNKNOWN
2022-06-18 18:00:18,785 WARN (RepNode 10.207.8.52_9010_1637575084600(-1)|63) [Catalog.notifyNewFETypeTransfer():2314] notify new FE type transfer: FOLLOWER
2022-06-18 18:01:14,717 WARN (replayer|78) [Catalog.replayJournal():2468] replay journal cost too much time: 55855 replayedJournalId: 264395853
2022-06-18 18:01:21,263 ERROR (UNKNOWN 10.207.8.52_9010_1637575084600(-1)|1) [QeService.():48] Help module failed, because:
java.io.IOException: Can not find help zip file: help-resource.zip
at com.starrocks.qe.HelpModule.setUpModule(HelpModule.java:267) ~[starrocks-fe.jar:?]
at com.starrocks.qe.QeService.(QeService.java:46) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:118) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:65) [starrocks-fe.jar:?]
2022-06-18 18:06:14,443 WARN (replayer|78) [BDBJournalCursor.next():128] fail to get journal 264490326, will try again. status: OperationStatus.NOTFOUND
2022-06-18 18:06:17,444 WARN (replayer|78) [Catalog.replayJournal():2468] replay journal cost too much time: 3001 replayedJournalId: 264490326
2022-06-18 18:10:34,054 WARN (replayer|78) [BDBJournalCursor.next():128] fail to get journal 264524197, will try again. status: OperationStatus.NOTFOUND
2022-06-18 18:10:37,055 WARN (replayer|78) [Catalog.replayJournal():2468] replay journal cost too much time: 3001 replayedJournalId: 264524197
2022-06-18 18:15:58,973 WARN (replayer|78) [BDBJournalCursor.next():148] Catch an exception when get next JournalEntity. key:264567868
com.sleepycat.je.LockTimeoutException: (JE 7.3.7) Lock expired. Locker 501460293 -1_replayer_ReplicaThreadLocker: waited for lock on database=264550001 LockAddr:168779928 LSN=0x42cf/0x3b5f65 type=READ grant=WAIT_NEW timeoutMillis=1000 startTime=1655547357968 endTime=1655547358968
Owners: [<LockInfo locker=“270218369 -265566208_ReplayThread_ReplayTxn” type=“WRITE”/>]
Waiters: []

    at com.sleepycat.je.txn.LockManager.makeTimeoutException(LockManager.java:1117) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.txn.LockManager.waitForLock(LockManager.java:606) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.txn.LockManager.lock(LockManager.java:345) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:124) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.rep.txn.ReplicaThreadLocker.lockInternal(ReplicaThreadLocker.java:63) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.txn.Locker.lock(Locker.java:499) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3585) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3316) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.dbi.CursorImpl.lockLNAndCheckDefunct(CursorImpl.java:2138) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.dbi.CursorImpl.searchExact(CursorImpl.java:1950) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Cursor.searchExact(Cursor.java:4194) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:4055) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Cursor.search(Cursor.java:3857) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Cursor.getInternal(Cursor.java:1284) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Database.get(Database.java:1271) ~[je-7.3.7.jar:7.3.7]
    at com.sleepycat.je.Database.get(Database.java:1330) ~[je-7.3.7.jar:7.3.7]
    at com.starrocks.journal.bdbje.CloseSafeDatabase.get(CloseSafeDatabase.java:47) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.next(BDBJournalCursor.java:108) [starrocks-fe.jar:?]
    at com.starrocks.catalog.Catalog.replayJournal(Catalog.java:2450) [starrocks-fe.jar:?]
    at com.starrocks.catalog.Catalog$3.runOneCycle(Catalog.java:2239) [starrocks-fe.jar:?]
    at com.starrocks.common.util.Daemon.run(Daemon.java:119) [starrocks-fe.jar:?]

2022-06-18 18:15:58,975 WARN (replayer|78) [Catalog.replayJournal():2468] replay journal cost too much time: 1007 replayedJournalId: 264567867

疑问:什么原因引起的这类故障 ,排查思路是什么 ?

show frontends看下结果,replayedJournalId这个字段