物理机挂了leader不能自动切换

【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【StarRocks版本】例如:2.5.1
【集群规模】例如:5fe+5be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:96C/256G/万兆

  • je.info.0 主要日志如下
    2023-02-01 20:59:10.709 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x43b: predicted min util is below minUtilization, current util min: 4 max: 4,
    2023-03-06 13:01:35.191 UTC INFO [10.240.162.165_9010_1647426628749] ReplicaOutputThread soft shutdown initiated.
    2023-03-06 13:01:35.191 UTC INFO [10.240.162.165_9010_1647426628749] Replica IO exception: java.io.IOException Message:Broken pipe
    2023-03-06 13:01:35.191 UTC INFO [10.240.162.165_9010_1647426628749] Exiting inner Replica loop.
    2023-03-06 13:01:35.191 UTC INFO [10.240.162.165_9010_1647426628749] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
    2023-03-06 13:01:35.191 UTC INFO [10.240.162.165_9010_1647426628749] node:NullNode(-1) state change from REPLICA to UNKNOWN
    2023-03-06 13:01:50.216 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Reason:java.net.SocketTimeoutException
    2023-03-06 13:02:14.369 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Reason:java.net.NoRouteToHostException: No route to host
    2023-03-06 13:32:35.583 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Reason:java.net.NoRouteToHostException: No route to host
    2023-03-06 13:32:52.606 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Reason:java.net.NoRouteToHostException: No route to host
    2023-03-06 13:33:09.628 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Reason:java.net.NoRouteToHostException: No route to host

2023-03-06 18:29:57.475 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable. Rea
2023-03-06 20:16:09.806 UTC INFO [10.240.162.165_9010_1647426628749] Queried master:10.240.162.164_9010_1647425764783(1) unavailable.
2023-03-07 02:15:18.473 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b7: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 50 max: 50, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:15:18.521 UTC INFO [10.240.162.165_9010_1647426628749] Started ServiceDispatcher. HostPort=sz42bla0813rock:9010
2023-03-07 02:15:18.522 UTC INFO [10.240.162.165_9010_1647426628749] DataChannel factory: com.sleepycat.je.rep.utilint.net.SimpleChannelFactory
2023-03-07 02:15:18.524 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 1 on file 0x4b7 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8368 nINsObsolete=268 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8098 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary=

recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:15:18.525 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b8: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 60 max: 60, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:15:18.567 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 2 on file 0x4b8 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8373 nINsObsolete=258 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8114 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:15:18.594 UTC INFO [10.240.162.165_9010_1647426628749] DTVLSN at start:38,466,078
2023-03-07 02:15:18.595 UTC INFO [10.240.162.165_9010_1647426628749] node:NullNode(-1) state change from DETACHED to UNKNOWN
2023-03-07 02:15:18.612 UTC INFO [10.240.162.165_9010_1647426628749] Current group size: 1
2023-03-07 02:15:18.612 UTC INFO [10.240.162.165_9010_1647426628749] Existing node 10.240.162.165_9010_1647426628749 querying for a current master.
2023-03-07 02:15:21.629 UTC INFO [10.240.162.165_9010_1647426628749] Node 10.240.162.165_9010_1647426628749 started as SECONDARY
2023-03-07 02:15:38.585 UTC INFO [10.240.162.165_9010_1647426628749] Cleaner has 2 files not deleted because they are protected by replication: 0x4b7-0x4b8 Candidates for deletion: 0x4b7-0x4b8 Computing replayCostMinVLSN: networkRestoreBytes=57,538,235 maxReplayBytes=38,358,823 replayBytes=19,998,419 firstVLSN=38,346,865 replayCostMinVLSN=38,346,865 Replication prevents deletion of 2 files to support initial replication. Retained VLSN range: -1(file 0x0) - 38,466,078(file 0x4bc). Global CBVLSN=-1(file 0x0); determining node: last reported at 1970-01-01 00:00:00.000 UTC. Files chosen for deletion by HA:
2023-03-07 02:36:49.445 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b7: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 50 max: 50, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:36:49.466 UTC INFO [10.240.162.165_9010_1647426628749] Started ServiceDispatcher. HostPort=sz42bla0813rock:9010
2023-03-07 02:36:49.467 UTC INFO [10.240.162.165_9010_1647426628749] DataChannel factory: com.sleepycat.je.rep.utilint.net.SimpleChannelFactory
2023-03-07 02:36:49.501 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 1 on file 0x4b7 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8368 nINsObsolete=268 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8098 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:36:49.502 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b8: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 60 max: 60, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:36:49.523 UTC INFO [10.240.162.165_9010_1647426628749] DTVLSN at start:38,466,078
2023-03-07 02:36:49.524 UTC INFO [10.240.162.165_9010_1647426628749] node:NullNode(-1) state change from DETACHED to UNKNOWN
2023-03-07 02:36:49.542 UTC INFO [10.240.162.165_9010_1647426628749] Current group size: 1
2023-03-07 02:36:49.542 UTC INFO [10.240.162.165_9010_1647426628749] Existing node 10.240.162.165_9010_1647426628749 querying for a current master.
2023-03-07 02:36:49.555 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 2 on file 0x4b8 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8373 nINsObsolete=258 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8114 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:36:51.620 UTC INFO [10.240.162.165_9010_1647426628749] Node 10.240.162.165_9010_1647426628749 started as SECONDARY
2023-03-07 02:37:09.625 UTC INFO [10.240.162.165_9010_1647426628749] Cleaner has 2 files not deleted because they are protected by replication: 0x4b7-0x4b8 Candidates for deletion: 0x4b7-0x4b8 Computing replayCostMinVLSN: networkRestoreBytes=57,539,864 maxReplayBytes=38,359,909 replayBytes=19,998,419 firstVLSN=38,346,865 replayCostMinVLSN=38,346,865 Replication prevents deletion of 2 files to support initial replication. Retained VLSN range: -1(file 0x0) - 38,466,078(file 0x4bc). Global CBVLSN=-1(file 0x0); determining node: last reported at 1970-01-01 00:00:00.000 UTC. Files chosen for deletion by HA:
2023-03-07 02:49:02.677 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b7: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 50 max: 50, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:49:02.701 UTC INFO [10.240.162.165_9010_1647426628749] Started ServiceDispatcher. HostPort=sz42bla0813rock:9010
2023-03-07 02:49:02.701 UTC INFO [10.240.162.165_9010_1647426628749] DataChannel factory: com.sleepycat.je.rep.utilint.net.SimpleChannelFactory
2023-03-07 02:49:02.726 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 1 on file 0x4b7 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8368 nINsObsolete=268 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8098 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:49:02.727 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b8: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 60 max: 60, chose file with util min: 0 max: 0 avg: 0
2023-03-07 02:49:02.762 UTC INFO [10.240.162.165_9010_1647426628749] DTVLSN at start:38,466,078
2023-03-07 02:49:02.763 UTC INFO [10.240.162.165_9010_1647426628749] node:NullNode(-1) state change from DETACHED to UNKNOWN
2023-03-07 02:49:02.774 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 2 on file 0x4b8 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8373 nINsObsolete=258 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8114 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 02:49:02.784 UTC INFO [10.240.162.165_9010_1647426628749] Current group size: 1
2023-03-07 02:49:02.784 UTC INFO [10.240.162.165_9010_1647426628749] Existing node 10.240.162.165_9010_1647426628749 querying for a current master.
2023-03-07 02:49:05.796 UTC INFO [10.240.162.165_9010_1647426628749] Node 10.240.162.165_9010_1647426628749 started as SECONDARY
2023-03-07 02:49:22.800 UTC INFO [10.240.162.165_9010_1647426628749] Cleaner has 2 files not deleted because they are protected by replication: 0x4b7-0x4b8 Candidates for deletion: 0x4b7-0x4b8 Computing replayCostMinVLSN: networkRestoreBytes=57,541,412 maxReplayBytes=38,360,941 replayBytes=19,998,419 firstVLSN=38,346,865 replayCostMinVLSN=38,346,865 Replication prevents deletion of 2 files to support initial replication. Retained VLSN range: -1(file 0x0) - 38,466,078(file 0x4bc). Global CBVLSN=-1(file 0x0); determining node: last reported at 1970-01-01 00:00:00.000 UTC. Files chosen for deletion by HA:
2023-03-07 03:05:43.760 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b7: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 50 max: 50, chose file with util min: 0 max: 0 avg: 0
2023-03-07 03:05:43.784 UTC INFO [10.240.162.165_9010_1647426628749] Started ServiceDispatcher. HostPort=sz42bla0813rock:9010
2023-03-07 03:05:43.784 UTC INFO [10.240.162.165_9010_1647426628749] DataChannel factory: com.sleepycat.je.rep.utilint.net.SimpleChannelFactory
2023-03-07 03:05:43.822 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 1 on file 0x4b7 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8368 nINsObsolete=268 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8098 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 03:05:43.822 UTC INFO [10.240.162.165_9010_1647426628749] Clean file 0x4b8: file has avg util below minFileUtilization, current util min: 50 max: 50, predicted util min: 60 max: 60, chose file with util min: 0 max: 0 avg: 0
2023-03-07 03:05:43.845 UTC INFO [10.240.162.165_9010_1647426628749] DTVLSN at start:38,466,078
2023-03-07 03:05:43.846 UTC INFO [10.240.162.165_9010_1647426628749] node:NullNode(-1) state change from DETACHED to UNKNOWN
2023-03-07 03:05:43.865 UTC INFO [10.240.162.165_9010_1647426628749] Current group size: 1
2023-03-07 03:05:43.866 UTC INFO [10.240.162.165_9010_1647426628749] Existing node 10.240.162.165_9010_1647426628749 querying for a current master.
2023-03-07 03:05:43.966 UTC INFO [10.240.162.165_9010_1647426628749] CleanerRun 2 on file 0x4b8 ends: invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=8373 nINsObsolete=258 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=8114 nLNsCleaned=0 nLNsDead=0 nLNsExpired=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 inSummary= estSummary= recalcSummary= estimatedUtil=0 recalcUtil=0
2023-03-07 03:05:46.878 UTC INFO [10.240.162.165_9010_1647426628749] Node 10.240.162.165_9010_1647426628749 started as SECONDARY
2023-03-07 03:06:04.006 UTC INFO [10.240.162.165_9010_1647426628749] Cleaner has 2 files not deleted because they are protected by replication: 0x4b7-0x4b8 Candidates for deletion: 0x4b7-0x4b8 Computing replayCostMinVLSN: networkRestoreBytes=57,543,088 maxReplayBytes=38,362,058 replayBytes=19,998,419 firstVLSN=38,346,865 replayCostMinVLSN=38,346,865 Replication prevents deletion of 2 files to support initial replication. Retained VLSN range: -1(file 0x0) - 38,466,078(file 0x4bc). Global CBVLSN=-1(file 0x0); determining node: last reported at 1970-01-01 00:00:00.000 UTC. Files chosen for deletion by HA:

fe是几follower 什么版本? 麻烦拿下挂的这台fe的fe.log日志

可能是我这边设置的有问题,全部设置成了observer,而没有follower的节点,然后导致leader挂了无法切换,我重新安装集群,设置follower试试,感谢感谢

嗯呢好的 您可以进行缩容再切换角色为follower节点 ha模式需要3个follower角色的节点才能组成 也推荐部署成HA模式