FE节点所在机器重启后,FE进程无法启动。

FE节点所在机器出现故障,重启机器后,再次启动FE进程,无法启动,查看fe.log日志如下:

2022-07-13 09:30:16,194 INFO (tablet scheduler|43) [TabletScheduler.adjustPriorities():382] adjust priority for all tablets. changed: 0, total: 0
2022-07-13 14:55:05,038 INFO (main|1) [StarRocksFE.start():104] StarRocks FE starting...
2022-07-13 14:55:05,063 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():130] configured prior_cidrs value: 192.168.0.162
2022-07-13 14:55:05,072 INFO (main|1) [FrontendOptions.init():98] local address: /192.168.0.162.
2022-07-13 14:55:05,926 INFO (main|1) [Catalog.getHelperNodes():1230] get helper nodes: [192.168.0.162:9010]
2022-07-13 14:55:05,969 INFO (main|1) [Catalog.getClusterIdAndRole():1104] finished to get cluster id: 1399819888, role: FOLLOWER and node name: 192.168.0.162_9010_1656389109189
2022-07-13 14:55:06,015 INFO (main|1) [Catalog.loadImage():1538] start load image from /data/starrocks/fe/meta/image/image.353464. is ckpt: false
2022-07-13 14:55:06,030 INFO (main|1) [Catalog.loadHeader():1682] finished replay header from image
2022-07-13 14:55:06,031 INFO (main|1) [Catalog.loadMasterInfo():1693] finished replay masterInfo from image
2022-07-13 14:55:06,234 INFO (main|1) [Catalog.loadDb():1736] finished replay databases from image
2022-07-13 14:55:06,245 INFO (main|1) [Catalog.loadLoadJob():1767] finished replay loadJob from image
2022-07-13 14:55:06,245 INFO (main|1) [Catalog.loadAlterJob():1799] finished replay alterJob from image
2022-07-13 14:55:06,246 INFO (main|1) [Catalog.loadRecycleBin():1956] finished replay recycleBin from image
2022-07-13 14:55:06,262 INFO (main|1) [Catalog.loadGlobalVariable():2267] finished replay globalVariable from image
2022-07-13 14:55:06,265 INFO (main|1) [Catalog.loadCluster():6777] finished replay cluster from image
2022-07-13 14:55:06,265 INFO (main|1) [Catalog.loadBrokers():6962] finished replay brokerMgr from image
2022-07-13 14:55:06,268 INFO (main|1) [Catalog.loadResources():1988] finished replay resources from image
2022-07-13 14:55:06,268 INFO (main|1) [Catalog.loadExportJob():1784] finished replay exportJob from image
2022-07-13 14:55:06,268 INFO (main|1) [Catalog.loadBackupHandler():1886] finished replay backupHandler from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadAuth():1928] finished replay auth from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadTransactionState():1937] finished replay transactionState from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadColocateTableIndex():1964] finished replay colocateTableIndex from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadRoutineLoadJobs():1972] finished replay routineLoadJobs from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadLoadJobsV2():1980] finished replay loadJobsV2 from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadSmallFiles():1996] finished replay smallFiles from image
2022-07-13 14:55:06,271 INFO (main|1) [Catalog.loadPlugins():7483] finished replay plugins from image
2022-07-13 14:55:06,430 INFO (main|1) [Catalog.loadDeleteHandler():1899] finished replay deleteHandler from image
2022-07-13 14:55:06,433 INFO (main|1) [Catalog.loadAnalyze():7519] finished replay analyze job from image
2022-07-13 14:55:06,437 INFO (main|1) [Catalog.loadWorkGroups():1906] finished replaying WorkGroups from image
2022-07-13 14:55:06,437 INFO (main|1) [Catalog.loadImage():1589] finished to load image in 422 ms
2022-07-13 14:55:08,678 INFO (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [BDBEnvironment.setup():188] add helper[192.168.0.162:9010] as ReplicationGroupAdmin
2022-07-13 14:55:08,685 WARN (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: UNKNOWN
2022-07-13 14:55:08,711 WARN (RepNode 192.168.0.162_9010_1656389109189(-1)|63) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: FOLLOWER
2022-07-13 14:55:08,722 WARN (REPLICA 192.168.0.162_9010_1656389109189(2)|63) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: UNKNOWN
2022-07-13 14:56:46,198 INFO (main|1) [StarRocksFE.start():104] StarRocks FE starting...
2022-07-13 14:56:46,204 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():130] configured prior_cidrs value: 192.168.0.162
2022-07-13 14:56:46,208 INFO (main|1) [FrontendOptions.init():98] local address: /192.168.0.162.
2022-07-13 14:56:46,584 INFO (main|1) [Catalog.getHelperNodes():1230] get helper nodes: [192.168.0.162:9010]
2022-07-13 14:56:46,604 INFO (main|1) [Catalog.getClusterIdAndRole():1104] finished to get cluster id: 1399819888, role: FOLLOWER and node name: 192.168.0.162_9010_1656389109189
2022-07-13 14:56:46,614 INFO (main|1) [Catalog.loadImage():1538] start load image from /data/starrocks/fe/meta/image/image.353464. is ckpt: false
2022-07-13 14:56:46,614 INFO (main|1) [Catalog.loadHeader():1682] finished replay header from image
2022-07-13 14:56:46,615 INFO (main|1) [Catalog.loadMasterInfo():1693] finished replay masterInfo from image
2022-07-13 14:56:46,752 INFO (main|1) [Catalog.loadDb():1736] finished replay databases from image
2022-07-13 14:56:46,762 INFO (main|1) [Catalog.loadLoadJob():1767] finished replay loadJob from image
2022-07-13 14:56:46,763 INFO (main|1) [Catalog.loadAlterJob():1799] finished replay alterJob from image
2022-07-13 14:56:46,764 INFO (main|1) [Catalog.loadRecycleBin():1956] finished replay recycleBin from image
2022-07-13 14:56:46,781 INFO (main|1) [Catalog.loadGlobalVariable():2267] finished replay globalVariable from image
2022-07-13 14:56:46,784 INFO (main|1) [Catalog.loadCluster():6777] finished replay cluster from image
2022-07-13 14:56:46,784 INFO (main|1) [Catalog.loadBrokers():6962] finished replay brokerMgr from image
2022-07-13 14:56:46,787 INFO (main|1) [Catalog.loadResources():1988] finished replay resources from image
2022-07-13 14:56:46,787 INFO (main|1) [Catalog.loadExportJob():1784] finished replay exportJob from image
2022-07-13 14:56:46,787 INFO (main|1) [Catalog.loadBackupHandler():1886] finished replay backupHandler from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadAuth():1928] finished replay auth from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadTransactionState():1937] finished replay transactionState from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadColocateTableIndex():1964] finished replay colocateTableIndex from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadRoutineLoadJobs():1972] finished replay routineLoadJobs from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadLoadJobsV2():1980] finished replay loadJobsV2 from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadSmallFiles():1996] finished replay smallFiles from image
2022-07-13 14:56:46,790 INFO (main|1) [Catalog.loadPlugins():7483] finished replay plugins from image
2022-07-13 14:56:46,867 INFO (main|1) [Catalog.loadDeleteHandler():1899] finished replay deleteHandler from image
2022-07-13 14:56:46,870 INFO (main|1) [Catalog.loadAnalyze():7519] finished replay analyze job from image
2022-07-13 14:56:46,874 INFO (main|1) [Catalog.loadWorkGroups():1906] finished replaying WorkGroups from image
2022-07-13 14:56:46,874 INFO (main|1) [Catalog.loadImage():1589] finished to load image in 260 ms
2022-07-13 14:56:47,321 INFO (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [BDBEnvironment.setup():188] add helper[192.168.0.162:9010] as ReplicationGroupAdmin
2022-07-13 14:56:47,326 WARN (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: UNKNOWN
2022-07-13 14:56:47,354 WARN (RepNode 192.168.0.162_9010_1656389109189(-1)|63) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: FOLLOWER
2022-07-13 14:56:47,369 WARN (REPLICA 192.168.0.162_9010_1656389109189(2)|63) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: UNKNOWN
2022-07-13 14:56:47,516 INFO (stateListener|76) [Catalog$4.runOneCycle():2419] begin to transfer FE type from INIT to UNKNOWN
2022-07-13 14:56:47,517 INFO (stateListener|76) [Catalog$4.runOneCycle():2505] finished to transfer FE type to UNKNOWN
2022-07-13 14:56:47,517 INFO (stateListener|76) [Catalog$4.runOneCycle():2419] begin to transfer FE type from UNKNOWN to FOLLOWER
2022-07-13 14:56:47,518 INFO (stateListener|76) [BDBHA.addHelperSocket():235] add 192.168.0.163:9010 to helper sockets
2022-07-13 14:56:47,519 INFO (stateListener|76) [BDBHA.addHelperSocket():235] add 192.168.0.161:9010 to helper sockets
2022-07-13 14:56:47,546 INFO (replayer|77) [Catalog.replayJournal():2522] replayed journal id is 353464, replay to journal id is 389264
2022-07-13 14:56:47,550 INFO (stateListener|76) [Catalog$4.runOneCycle():2505] finished to transfer FE type to FOLLOWER
2022-07-13 14:56:47,550 INFO (stateListener|76) [Catalog$4.runOneCycle():2419] begin to transfer FE type from FOLLOWER to UNKNOWN
2022-07-13 14:56:47,550 WARN (stateListener|76) [Catalog.transferToNonMaster():1403] FOLLOWER to UNKNOWN, still offer read service
2022-07-13 14:56:47,550 INFO (stateListener|76) [Catalog$4.runOneCycle():2505] finished to transfer FE type to UNKNOWN
2022-07-13 14:56:49,022 WARN (replayer|77) [Catalog.replayJournal():2550] replay journal cost too much time: 1473 replayedJournalId: 389264
2022-07-13 14:56:49,023 WARN (replayer|77) [Catalog.setCanRead():2367] meta out of date. current time: 1657695409023, synchronized time: 1657675726112, has log: true, fe type: UNKNOWN
2022-07-13 14:56:49,410 WARN (UNKNOWN 192.168.0.162_9010_1656389109189(2)|63) [Catalog.notifyNewFETypeTransfer():2396] notify new FE type transfer: FOLLOWER
2022-07-13 14:56:49,412 INFO (stateListener|76) [Catalog$4.runOneCycle():2419] begin to transfer FE type from UNKNOWN to FOLLOWER
2022-07-13 14:56:49,414 INFO (stateListener|76) [Catalog$4.runOneCycle():2505] finished to transfer FE type to FOLLOWER
2022-07-13 14:56:49,435 WARN (REPLICA 192.168.0.162_9010_1656389109189(2)|63) [BDBStateChangeListener.stateChange():61] this node is DETACHED
2022-07-13 14:56:49,456 INFO (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [Catalog.waitForReady():910] wait catalog to be ready. FE type: FOLLOWER. is ready: false
2022-07-13 14:56:51,456 INFO (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [Catalog.waitForReady():910] wait catalog to be ready. FE type: FOLLOWER. is ready: false
2022-07-13 14:56:53,457 INFO (UNKNOWN 192.168.0.162_9010_1656389109189(-1)|1) [Catalog.waitForReady():910] wait catalog to be ready. FE type: FOLLOWER. is ready: false
2022-07-13 14:56:54,026 WARN (replayer|77) [BDBEnvironment.getDatabaseNames():384] catch rollback exception, please restart
com.sleepycat.je.rep.RollbackException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 7.3.7) 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb Node 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb must rollback 3 total commits(1 of which were durable) to the earliest point indicated by transaction id=-596119 time=2022-07-13 09:28:42.692 vlsn=985,501 lsn=0x1e/0x3e993 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x30, offset 0x256359, vlsn 985,500 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2) Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2) Environment invalid because of previous exception: (JE 7.3.7) 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb Node 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb must rollback 3 total commits(1 of which were durable) to the earliest point indicated by transaction id=-596119 time=2022-07-13 09:28:42.692 vlsn=985,501 lsn=0x1e/0x3e993 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x30, offset 0x256359, vlsn 985,500 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2) Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2)
        at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:146) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:62) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Environment.checkOpen(Environment.java:2473) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Environment.getDatabaseNames(Environment.java:2245) ~[je-7.3.7.jar:7.3.7]
        at com.starrocks.journal.bdbje.BDBEnvironment.getDatabaseNames(BDBEnvironment.java:374) [starrocks-fe.jar:?]
        at com.starrocks.journal.bdbje.BDBJEJournal.getMaxJournalId(BDBJEJournal.java:213) [starrocks-fe.jar:?]
        at com.starrocks.persist.EditLog.getMaxJournalId(EditLog.java:99) [starrocks-fe.jar:?]
        at com.starrocks.catalog.Catalog.getMaxJournalId(Catalog.java:5528) [starrocks-fe.jar:?]
        at com.starrocks.catalog.Catalog.replayJournal(Catalog.java:2516) [starrocks-fe.jar:?]
        at com.starrocks.catalog.Catalog$3.runOneCycle(Catalog.java:2324) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:119) [starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 7.3.7) 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb Node 192.168.0.162_9010_1656389109189(2):/data/starrocks/fe/meta/bdb must rollback 3 total commits(1 of which were durable) to the earliest point indicated by transaction id=-596119 time=2022-07-13 09:28:42.692 vlsn=985,501 lsn=0x1e/0x3e993 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x30, offset 0x256359, vlsn 985,500 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2) Originally thrown by HA thread: REPLICA 192.168.0.162_9010_1656389109189(2)
        at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupHardRecovery(ReplicaFeederSyncup.java:680) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:372) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:157) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:711) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:474) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:409) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1873) ~[je-7.3.7.jar:7.3.7]

starrocks版本:2.2.1

您好,请问您是做了什么操作导致机器故障?具体的故障是指?你一共几台fe?这个是master节点吗?

这个再启动一下就可以了,这个原因是当时他应该是master节点,写了一些数据到本机但是没有同步给其他节点,这个节点再次启动的时候,需要把那些数据rollback掉。

1赞

这个是master节点,一共3个FE,该机器磁盘出现故障,重启机器后将磁盘重新挂载。

rollback掉的话,那写入的那些数据也没同步给其他节点,那就造成数据丢失了?

请问最后是如何解决的,我这边也遇到了同样的问题,SR版本3.1.6