fe节点突然挂掉

【详述】leader节点挂掉,接着另一台follower挂掉
【背景】
【业务影响】
【StarRocks版本】例如:3.0.2
【集群规模】例如:3fe(3follower,1台单独,2台混布)+12be(2台混布)
【机器信息】8c32G
【联系方式】18840044216
【附件】
fe.log:

2023-09-15 07:42:33,747 INFO (starrocks-mysql-nio-pool-13887|23350875) [QeProcessorImpl.registerQuery():95] register query id = 6091a8d7-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,749 INFO (starrocks-mysql-nio-pool-13889|23350880) [QeProcessorImpl.registerQuery():95] register query id = 6091f6f8-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,755 INFO (leaderCheckpointer|87) [BDBJEJournal.getFinalizedJournalId():275] database names: 156058258
2023-09-15 07:42:33,755 INFO (leaderCheckpointer|87) [Checkpoint.runAfterCatalogReady():95] checkpoint imageVersion 156058257, checkPointVersion 0
2023-09-15 07:42:33,765 INFO (starrocks-mysql-nio-pool-13887|23350875) [QeProcessorImpl.unregisterQuery():105] deregister query id 6091a8d7-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,766 INFO (nioEventLoopGroup-7-2|336) [RestBaseAction.handleRequest():70] receive http request. url=/api/starrocks_audit_db__/starrocks_audit_tbl__/stream_load?
2023-09-15 07:42:33,766 WARN (nioEventLoopGroup-7-2|336) [RestBaseAction.handleRequest():75] fail to process url: /api/starrocks_audit_db
_/starrocks_audit_tbl__/stream_load?
com.starrocks.http.UnauthorizedException: Access denied for root@172.16.4.127
at com.starrocks.http.BaseAction.checkPassword(BaseAction.java:364) ~[starrocks-fe.jar:?]
at com.starrocks.http.rest.RestBaseAction.execute(RestBaseAction.java:90) ~[starrocks-fe.jar:?]
at com.starrocks.http.rest.RestBaseAction.handleRequest(RestBaseAction.java:73) ~[starrocks-fe.jar:?]
at com.starrocks.http.HttpServerHandler.channelRead(HttpServerHandler.java:93) ~[starrocks-fe.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-all-4.1.61.Final.jar:4.1.61.Final]
at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[netty-all-4.1.61.Final.jar:4.1.61.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-all-4.1.61.Final.jar:4.1.61.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[netty-all-4.1.61.Final.jar:4.1.61.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[netty-all-4.1.61.Final.jar:4.1.61.Final]
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
2023-09-15 07:42:33,767 INFO (starrocks-mysql-nio-pool-13889|23350880) [QeProcessorImpl.unregisterQuery():105] deregister query id 6091f6f8-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,770 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.registerQuery():95] register query id = 609181c6-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,770 WARN (audit loader thread|72) [StarrocksStreamLoader.loadBatch():151] failed to load audit via AuditLoader plugin with label: audit_20230915_074233_172_16_4_127_9010
java.lang.Exception: status is not TEMPORARY_REDIRECT 307, status: 401, response: {“status”:“FAILED”,“msg”:“Access denied for root@172.16.4.127”}, request is: curl -v -X PUT
-H “Authorization”:“Basic cm9vdDo=”
-H “Expect”:“100-continue”
-H “Content-Type”:“text/plain; charset=UTF-8”
-H “max_filter_ratio”:“1.0”
-H “columns”:“queryId, timestamp, queryType, clientIp, user, authorizedUser, resourceGroup, catalog, db, state, errorCode,queryTime, scanBytes, scanRows, returnRows, cpuCostNs, memCostBytes, stmtId, isQuery, feIp, stmt, digest, planCpuCosts, planMemCosts”
"http://172.16.4.127:8030/api/starrocks_audit_db
_/starrocks_audit_tbl__/_stream_load?"
at com.starrocks.plugin.audit.StarrocksStreamLoader.loadBatch(StarrocksStreamLoader.java:125) ~[?:?]
at com.starrocks.plugin.audit.AuditLoaderPlugin.loadIfNecessary(AuditLoaderPlugin.java:197) ~[?:?]
at com.starrocks.plugin.audit.AuditLoaderPlugin.access$300(AuditLoaderPlugin.java:47) ~[?:?]
at com.starrocks.plugin.audit.AuditLoaderPlugin$LoadWorker.run(AuditLoaderPlugin.java:286) ~[?:?]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
2023-09-15 07:42:33,778 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.unregisterQuery():105] deregister query id 609181c6-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,781 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.registerQuery():95] register query id = 60970011-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,783 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.unregisterQuery():105] deregister query id 60970011-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,786 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.registerQuery():95] register query id = 6097ea72-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:33,810 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.unregisterQuery():105] deregister query id 6097ea72-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:55,305 INFO (colocate group clone checker|95) [ColocateTableBalancer.matchGroups():894] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2023-09-15 07:42:55,312 INFO (starrocks-mysql-nio I/O-4|147) [AcceptListener.handleEvent():71] Connection established. remote=/10.0.1.211:40564
2023-09-15 07:42:55,321 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.registerQuery():95] register query id = 6d6d9748-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:55,327 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.unregisterQuery():105] deregister query id 6d6d9748-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:55,331 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.registerQuery():95] register query id = 6d6f44f9-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:42:55,334 INFO (starrocks-mysql-nio-pool-13891|23350882) [QeProcessorImpl.unregisterQuery():105] deregister query id 6d6f44f9-5358-11ee-9fef-00163e09a0ba
2023-09-15 07:43:31,727 INFO (colocate group clone checker|95) [ColocateTableBalancer.matchGroups():894] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2023-09-15 07:43:31,727 INFO (nioEventLoopGroup-7-3|716) [RestBaseAction.handleRequest():70] receive http request. url=/api/bootstrap?cluster_id=1531098948&token=85a39bb5-d5cd-479a-9e5f-7a6676a88eae
2023-09-15 07:43:31,727 ERROR (JournalWriter|86) [BDBJEJournal.batchWriteBegin():322] failed to begin txn after retried 3 times! db = CloseSafeDatabase{db=156058258}
com.sleepycat.je.rep.InsufficientReplicasException: (JE 18.3.13) Commit policy: SIMPLE_MAJORITY required 1 replica. But none were active with this master.
at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureReplicasForCommit(DurabilityQuorum.java:116) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.RepImpl.txnBeginHook(RepImpl.java:1171) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.txnBeginHook(MasterTxn.java:195) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.initTxn(Txn.java:384) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.(Txn.java:288) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.(Txn.java:267) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.(MasterTxn.java:146) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn$1.create(MasterTxn.java:117) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.create(MasterTxn.java:435) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.RepImpl.createRepUserTxn(RepImpl.java:1145) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.createUserTxn(Txn.java:315) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.TxnManager.txnBegin(TxnManager.java:199) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.dbi.EnvironmentImpl.txnBegin(EnvironmentImpl.java:2540) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.beginTransactionInternal(Environment.java:1498) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.beginTransaction(Environment.java:1383) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteBegin(BDBJEJournal.java:316) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:107) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:87) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
2023-09-15 07:43:31,727 WARN (JournalWriter|86) [JournalWriter.writeOneBatch():122] failed to write batch, will abort current journal com.starrocks.journal.JournalTask@4a5443e0 and commit
com.starrocks.journal.JournalException: failed to begin txn after retried 3 times! db = CloseSafeDatabase{db=156058258}
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteBegin(BDBJEJournal.java:323) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter.writeOneBatch(JournalWriter.java:107) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalWriter$1.runOneCycle(JournalWriter.java:87) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.InsufficientReplicasException: (JE 18.3.13) Commit policy: SIMPLE_MAJORITY required 1 replica. But none were active with this master.
at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureReplicasForCommit(DurabilityQuorum.java:116) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.RepImpl.txnBeginHook(RepImpl.java:1171) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.txnBeginHook(MasterTxn.java:195) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.initTxn(Txn.java:384) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.(Txn.java:288) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.(Txn.java:267) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.(MasterTxn.java:146) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn$1.create(MasterTxn.java:117) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.txn.MasterTxn.create(MasterTxn.java:435) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.RepImpl.createRepUserTxn(RepImpl.java:1145) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.Txn.createUserTxn(Txn.java:315) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.txn.TxnManager.txnBegin(TxnManager.java:199) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.dbi.EnvironmentImpl.txnBegin(EnvironmentImpl.java:2540) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.beginTransactionInternal(Environment.java:1498) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.beginTransaction(Environment.java:1383) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.batchWriteBegin(BDBJEJournal.java:316) ~[starrocks-fe.jar:?]
… 3 more
2023-09-15 07:43:31,728 ERROR (JournalWriter|86) [JournalWriter.abortJournalTask():176] failed to begin txn after retried 3 times! db = CloseSafeDatabase{db=156058258}
2023-09-15 07:43:31,728 INFO (Thread-85|158) [StarRocksFE.lambda$addShutdownHook$1():368] start to execute shutdown hook
2023-09-15 07:43:31,732 WARN (nioEventLoopGroup-7-1|241) [HttpServerHandler.exceptionCaught():119] java.io.IOException: Connection reset by peer
2023-09-15 07:43:31,733 INFO (Thread-85|158) [StarRocksFE.lambda$addShutdownHook$1():393] shutdown hook end
2023-09-15 09:27:31,839 INFO (main|1) [StarRocksFE.start():124] StarRocks FE starting, version: 3.0.2-c833698
2023-09-15 09:27:31,845 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():299] configured prior_cidrs value: 172.16.4.127
2023-09-15 09:27:31,848 INFO (main|1) [FrontendOptions.initAddrUseIp():249] Use IP init local addr, IP: /172.16.4.127
2023-09-15 09:27:32,134 INFO (main|1) [Auth.grantRoleInternal():831] grant operator to ‘root’@’%’, isReplay = true
2023-09-15 09:27:32,162 INFO (main|1) [AuthorizationManager.initBuiltinRoleUnlocked():286] create built-in role root[-1]
2023-09-15 09:27:32,168 INFO (main|1) [AuthorizationManager.initBuiltinRoleUnlocked():286] create built-in role db_admin[-2]
2023-09-15 09:27:32,168 INFO (main|1) [AuthorizationManager.initBuiltinRoleUnlocked():286] create built-in role cluster_admin[-3]
2023-09-15 09:27:32,169 INFO (main|1) [AuthorizationManager.initBuiltinRoleUnlocked():286] create built-in role user_admin[-4]
2023-09-15 09:27:32,169 INFO (main|1) [AuthorizationManager.initBuiltinRoleUnlocked():286] create built-in role public[-5]
2023-09-15 09:27:32,169 INFO (main|1) [GlobalStateMgr.initAuth():1015] using new privilege framework…
2023-09-15 09:27:32,392 INFO (main|1) [NodeMgr.getHelperNodes():645] get helper nodes: [172.16.4.127:9010]
2023-09-15 09:27:32,409 INFO (main|1) [NodeMgr.getClusterIdAndRoleOnStartup():438] Current run_mode is null
2023-09-15 09:27:32,409 INFO (main|1) [NodeMgr.getClusterIdAndRoleOnStartup():445] Got cluster id: 1531098948, role: FOLLOWER, node name: 172.16.4.127_9010_1672906882230 and run_mode: null
2023-09-15 09:27:32,410 INFO (main|1) [BDBEnvironment.ensureHelperInLocal():340] skip check local environment because helper node and local node are identical.
2023-09-15 09:27:32,448 INFO (main|1) [BDBEnvironment.setupEnvironment():269] start to setup bdb environment for 1 times
2023-09-15 09:27:33,226 WARN (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [StateChangeExecutor.notifyNewFETypeTransfer():62] notify new FE type transfer: UNKNOWN
2023-09-15 09:27:33,250 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():280] replicated environment is all set, wait for state change…
2023-09-15 09:27:33,350 WARN (RepNode 172.16.4.127_9010_1672906882230(-1)|56) [StateChangeExecutor.notifyNewFETypeTransfer():62] notify new FE type transfer: FOLLOWER
2023-09-15 09:27:33,363 WARN (REPLICA 172.16.4.127_9010_1672906882230(1)|56) [StateChangeExecutor.notifyNewFETypeTransfer():62] notify new FE type transfer: UNKNOWN
2023-09-15 09:27:33,376 WARN (UNKNOWN 172.16.4.127_9010_1672906882230(1)|56) [StateChangeExecutor.notifyNewFETypeTransfer():62] notify new FE type transfer: FOLLOWER
2023-09-15 09:27:33,386 WARN (REPLICA 172.16.4.127_9010_1672906882230(1)|56) [BDBStateChangeListener.stateChange():79] this node is DETACHED
2023-09-15 09:27:34,251 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():288] state change done, current role FOLLOWER
2023-09-15 09:27:34,252 ERROR (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():305] failed to setup environment after retried 1 times
com.sleepycat.je.rep.RollbackException: (JE 18.3.13) Environment must be closed, caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.13) 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb Node 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb must rollback 1 total commits(1 of which were durable) to the earliest point indicated by transaction id=-160943999 time=2023-09-15 07:42:03.489 vlsn=312,047,109 lsn=0x41b9/0x930376 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated. Log files were truncated to file 0x16825, offset 0x9634631, vlsn 312,047,108 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1) Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1) Environment invalid because of previous exception: (JE 18.3.13) 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb Node 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb must rollback 1 total commits(1 of which were durable) to the earliest point indicated by transaction id=-160943999 time=2023-09-15 07:42:03.489 vlsn=312,047,109 lsn=0x41b9/0x930376 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated. Log files were truncated to file 0x16825, offset 0x9634631, vlsn 312,047,108 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1) Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1)
at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:146) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:62) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.checkOpen(Environment.java:2697) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.Environment.openDatabase(Environment.java:659) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.setupEnvironment(BDBEnvironment.java:291) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:175) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.initBDBEnvironment(BDBEnvironment.java:153) ~[starrocks-fe.jar:?]
at com.starrocks.journal.JournalFactory.create(JournalFactory.java:31) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.initJournal(GlobalStateMgr.java:1039) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.initialize(GlobalStateMgr.java:988) ~[starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:130) ~[starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:82) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.13) 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb Node 172.16.4.127_9010_1672906882230(1):/opt/module/meta/bdb must rollback 1 total commits(1 of which were durable) to the earliest point indicated by transaction id=-160943999 time=2023-09-15 07:42:03.489 vlsn=312,047,109 lsn=0x41b9/0x930376 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated. Log files were truncated to file 0x16825, offset 0x9634631, vlsn 312,047,108 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1) Originally thrown by HA thread: REPLICA 172.16.4.127_9010_1672906882230(1)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupHardRecovery(ReplicaFeederSyncup.java:721) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:417) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:164) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:732) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:485) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:412) ~[starrocks-bdb-je-18.3.13.jar:?]
at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1869) ~[starrocks-bdb-je-18.3.13.jar:?]
2023-09-15 09:27:34,256 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.close():517] start to close epoch database
2023-09-15 09:27:34,256 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.close():526] close epoch database end
2023-09-15 09:27:34,256 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.close():528] start to close replicated environment
2023-09-15 09:27:34,257 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.close():538] close replicated environment end
2023-09-15 09:27:39,258 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():269] start to setup bdb environment for 2 times
2023-09-15 09:27:39,618 WARN (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [StateChangeExecutor.notifyNewFETypeTransfer():62] notify new FE type transfer: UNKNOWN
2023-09-15 09:27:39,619 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():280] replicated environment is all set, wait for state change…
2023-09-15 09:27:49,621 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():288] state change done, current role UNKNOWN
2023-09-15 09:27:49,628 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [BDBEnvironment.setupEnvironment():292] end setup bdb environment after 2 times
2023-09-15 09:27:49,634 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [GlobalStateMgr.loadImage():1337] start load image from /opt/module/meta/image/image.156058257. is ckpt: false
2023-09-15 09:27:49,635 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [GlobalStateMgr.loadHeader():1534] finished replay header from image
2023-09-15 09:27:49,636 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [NodeMgr.loadLeaderInfo():1203] finished replay masterInfo from image
2023-09-15 09:27:56,578 INFO (UNKNOWN 172.16.4.127_9010_1672906882230(-1)|1) [LocalMetastore.loadDb():356] finished replay databases from image

挂掉之前有进行什么操作么?挂掉的是leader节点么?是新部署的集群么?

leader节点先挂的是在7点43左右,当时应该有一些调度sql在执行,我们之前是fe单节点,前一阵子刚加的2台follower组成高可用

com.sleepycat.je.rep.InsufficientReplicasException: (JE 7.3.7) Commit policy: SIMPLE_MAJORITY required 1 replica. But none were active with this master.

这个问题内存问题,可能是Leader节点内存使用过高发生了Full GC,也可能是Follower节点内存使用过高发生了Full GC。可以增大jvm的内存的xmx配置来缓解。可以监控下fe节点的内存使用,是否会持续增加,如果持续上涨可以打下pstack

请问是改JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx16384m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:${LOG_DIR}/fe.gc.log.$DATE -XX:+PrintConcurrentLocks"

里面的Xmx吗

是的,可以调大一点,集群现在数据有多大?tablet大概在多少个,现在fe可以启动起来么?

现在集群总数据量应该在8-9TB左右,teblet总数 438万,现在3台fe,早上启动后都是运行着的

可以先调大xmx,然后观察下内存。咱的分桶设计需要注意下,目前看分桶数设置的过高,当个分桶数据量较少,不是特别合理。建议单个分桶数据量在100M-1G

麻烦问下这个调完,是需要重启fe节点才能生效吗

是的,需要重启fe生效