3个fe节点的元数据没有同步,只有主节点在起作用

【详述】问题详细描述
三个fe节点,一直在用主节点,一段时间后发现,其他两个节点的数据库信息和主节点fe的数据库信息不一样,三个fe的信息没有同步
【背景】创建hive外表,刷新外表分区时报其他两个节点没有该表信息
【业务影响】
【StarRocks版本】2.0.0-GA
【集群规模】例如:3fe(3 follower)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【附件】
从节点的日志截图

  • fe.warn.log/be.warn.log日志信息
    2022-06-10 15:09:48,648 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:48,649 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:48,649 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:53,652 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:53,653 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:53,653 WARN (heartbeat mgr|40) [HeartbeatMgr.runAfterCatalogReady():143] get bad heartbeat response: type: BACKEND, status: BAD, msg: epoch is not greater than local. ignore heartbeat.
    2022-06-10 15:09:54,671 WARN (Routine load task scheduler|53) [RoutineLoadTaskScheduler.process():103] no available be slot to scheduler tasks, wait for 10 seconds to scheduler again, you can set max_routine_load_task_num_per_be bigger in fe.conf, current value is 5

刷新hive外表分区信息时报错

show frontends截图看下,另外另外fe.log看下


正常fe.log日志

异常fe.log日志

这个看起来是你添加节点的时候,第一次启动没有指定–helper

1赞

当时好像是这样的,那要怎么修复了

是把这个fe删掉重新安装吗?

那个正常的fe节点是哪个?

正常节点是我们现在正常使用的,异常节点我们基本没连接过,表信息没同步

那把另外两个fe drop掉,元数据目录清空。然后通过–helper重新加入集群。

1赞

但是这样处理的话,你在那两个fe上建的那些库/表就丢失了。但这也没有办法。

没事的,那两个节点上建的表没什么用,这样做对那个能用的节点没影响吧,对现在的运行在主节点的导入任务没影响吧

没事的,不会对那个现有的有影响。

停掉两个fe后,启动失败了,日志中报这样的错

2022-06-13 14:19:17,046 ERROR (MASTER 10.1.60.12_9010_1655101155153(-1)|1) [BDBJEJournal.open():270] catch an exception when setup bdb environment. will exit.
com.sleepycat.je.DatabaseNotFoundException: (JE 7.3.7) _jeRepGroupDB
at com.sleepycat.je.rep.impl.RepImpl.openGroupDb(RepImpl.java:1933) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.getGroupDb(RepImpl.java:1871) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepGroupDB.reinitFirstNode(RepGroupDB.java:1433) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.node.RepNode.reinitSelfElect(RepNode.java:1734) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.node.RepNode.startup(RepNode.java:895) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.node.RepNode.joinGroup(RepNode.java:2157) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.impl.RepImpl.joinGroup(RepImpl.java:610) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.ReplicatedEnvironment.joinGroup(ReplicatedEnvironment.java:560) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:621) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:466) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:540) ~[je-7.3.7.jar:7.3.7]
at com.sleepycat.je.rep.util.DbResetRepGroup.reset(DbResetRepGroup.java:262) ~[je-7.3.7.jar:7.3.7]
at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:105) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.open(BDBJEJournal.java:267) [starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.open(EditLog.java:832) [starrocks-fe.jar:?]
at com.starrocks.catalog.Catalog.initialize(Catalog.java:849) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:110) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:65) [starrocks-fe.jar:?]
(END)

你是不是配置了metadata_failure_recovery=true?

1赞

是的,把其他两个fe的配置改成这个,根据官网文档上的元数据恢复修复了