FE checkpoint失败

FE checkpoint失败

现象

版本:2.1

  1. FE master 报错日志
    The information_schema db is not persisted, it is generated in loadCluster when loading image. Its id will increase once the loadCluster is called. If the cluster runs for a long time(3 months for example), its id may be bigger than 10000 ( which is the upper limit of system id), because the loadCluster will be called every time a checkpoint is done. The reason why the information_schema db’s id will be increased maybe there will be many clusters in the system and every cluster will have its own information_schema db. But the cluster has be deprecated in our system, so we just give a fixed id number for every system db or table.

  2. bdb文件巨大

  3. FE master 内存监控

风险

FE 配置 8c32g,bdb文件过大,重启fe存在启动不了的风险

处理方式

  1. 先给一个follower搞个内存大的机器
  2. 看看这个follower的启动时间多长,以及占用了多少内存
  3. 然后占用的内存是机器内存的一半,可以完成正常的checkpoint
  4. 给这个节点指定成 新的 master
    java -jar fe/lib/je-7.3.7.jar DbGroupAdmin -helperHosts {ip:edit_log_port} -groupName PALO_JOURNAL_GROUP -transferMaster -force {node_name} 5000
  5. 等待checkpoint完成

注意

bdb不能手动删除,需要等待系统自动理清。
但条件是所有fe节点都online状态,如果存在运行不正常的fe,则bdb不会自动删除数据。此时需要主动drop非正常的fe。
待fe都恢复正常后,bdb会减少空间,最后重新加入剔除的fe节点即可。

3赞