2.5.11 FE节点异常离线

版本2.5.11,5FE 11BE

看日志是半夜里FE节点被踢出了集群,想问下这类原因是网络抖动造成的吗?

fe.out里的输出:

[2023-09-11 02:51:46] notify new FE type transfer: UNKNOWN
[2023-09-11 02:51:47] notify new FE type transfer: FOLLOWER
[2023-09-11 02:52:19] notify new FE type transfer: UNKNOWN
[2023-09-11 02:52:30] failed to get DB names for 1 times!Got EnvironmentFailureException and the current ReplicatedEnvironment is invalid, will exit.
[2023-09-11 02:52:30] this node is DETACHED

附件是fe.log 和 fe.warn.log输出
fe.log_20230911 (111.1 KB) fe.warn.log_20230911 (22.8 KB)

现在宕机的 fe 正常启动能起来么

FE已经正常启动了,就是想知道离线的原因 :joy:

发一下这个fe节点 包含 2023-09-11 02:51左右的 fe.gc.log

fe.gc.log_20230911 (48.3 KB)

是宕机的 fe的么,这个gc日志只到 2023-09-11T02:41:47.901+0800:

是的,宕机FE的gc,还需要其他fe节点的gc日志吗?

leader fe的 gc日志 也给一下,另外这5个节点都是 leader+ follower 么,还是有 observer,宕机的fe节点是什么角色

fe.gc.log_leader (752.4 KB)
单纯的leader+follower架构,宕机的fe是follower

想问下有什么结论吗?昨晚凌晨又挂了两个FE节点,一个leader和一个follower :sob: :sob: :sob:

日志中的报错还是一样的么
2023-09-11 02:52:30,570 WARN (replayer|73) [BDBJournalCursor.wrapDatabaseException():84] failed to get DB names for 1 times!Got EnvironmentFailureException and the current ReplicatedEnvironment is invalid, will exit.
com.sleepycat.je.EnvironmentFailureException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) 10.133.58.206_9010_1690453016264(26):/data1/starrocks/starrocks/fe/meta/bdb A replica with the name: 10.133.58.206_9010_1690453016264(26) is already active with the Feeder:null HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity or compatibility check failed, preventing further communication between the nodes. Environment is invalid and must be closed.

leader里貌似不一样,没有handshake关键字

2023-09-13 00:26:07,581 WARN (thrift-server-pool-33721911|33890273) [TIOStreamTransport.close():110] Error closing output stream.
java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) ~[?:1.8.0_202]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_202]
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_202]
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_202]
        at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_202]
        at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:108) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.transport.TSocket.close(TSocket.java:235) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:326) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2023-09-13 00:26:07,581 INFO (tablet scheduler|30) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 1.000385345874084, low/mid/high: 5/1/5
2023-09-13 00:26:07,587 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,587 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,588 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,588 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,589 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,590 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,591 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,592 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch
2023-09-13 00:26:07,595 WARN (heartbeat mgr|29) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: Out-dated epoch

现在我主要想确定一个排查的大方向 :joy:
如果是网络问题,我就去找云厂商排查了

附上挂掉的fe leader节点的fe.log
fe_leader.log20230913 (1.5 MB)
fe.gc里报错是一样的

fe 的jvm是多大的,两个宕机的fe 分别是哪个时间点宕机的

发一下 show backends; 和 show proc ‘/statistic’; 的结果

FE JVM 8G,leader是00:26,follower是06:20,最主要还是leader节点离线原因
下面的信息已经私信发了

  1. 可以考虑换成 jdk11,用 g1 回收器,观察是否还会容易宕机
  2. 当前 jvm 内存较小,如果用 jdk11 的 g1 不行的话,可以把 jvm 的内存调大一点
    调整后再观察一段时间看看

修改 fe.conf 中
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx16g -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"

JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx16g -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"

1赞

请问下这个换成jdk11之后好了吗,我们集群jvm的内存是32G刚才也出现这个问题了

可以改用G1 GC算法