模拟机器断电重启导致FE无法启动

rootwang · 2026年01月16日 06:51

为了更快的定位您的问题，请提供以下信息，谢谢
【详述】在starrocks正常运行的状态下，reboot每台虚拟机，启动后fe无法正常启动
【背景】在3台虚拟机上搭建了3台FE和3台BE的集群，可以都正常运行，fe.conf和be.conf下面也都配置了priority_networks为{当前部署虚拟机ip}/24，配置好service服务，加入开机自启服务，集群没有数据，是个全新的集群，然后在服务正常运行的情况下，直接reboot虚拟机，但是重启后发现FE集群无法启动，mysql客户端无法连接9030端口
【业务影响】
【是否存算分离】否
【StarRocks版本】3.4
【集群规模】3fe（1 leader + 2 follower）+3be（fe与be混部）
【机器信息】内存16G，磁盘没有满
【联系方式】rootwang@163.com
【附件】

三台fe.log中反复报

image2634×1148 516 KB

我查看其中提示信息，在我这里都不存在，时间是同步的：
It took too much time for FE to transfer to a stable state(LEADER/FOLLOWER), it maybe caused by one of the following reasons:
1. There are too many BDB logs to replay, because of previous failure of checkpoint(you can check the create time of image file under meta/image dir).
2. Majority voting members(LEADER or FOLLOWER) of the FE cluster haven’t started completely.
3. FE node has multiple IPs, you should configure the priority_networks in fe.conf to match the ip record in meta/image/ROLE. And we don’t support change the ip of FE node. Ignore this reason if you are using FQDN.
4. The time deviation between FE nodes is greater than 5s, please use ntp or other tools to keep clock synchronized.
5. The configuration of edit_log_port has changed, please reset to the original value.
6. The replayer thread may get stuck, please use jstack to find the details.

各位大佬请帮忙看看，谢谢,相关日志如下admin-3.zip (63.8 KB)

rootwang · 2026年01月16日 07:36

BE都是启动成功的

rootwang · 2026年01月16日 07:53

集群第一次启动时，第一个启动的FE为11，其它2个启动方式依照官网start_fe.sh --helper 192.168.10.11:9010 --daemon；reboot虚拟机启动时就没加–helper了

rootwang · 2026年01月16日 10:44

已解决，还是priority_networks 的问题，将每台的${本地虚拟机IP} 改为 192.168.10.0/24 即可，但启动的时间还是有点长