开源版1.19.0部署fe集群

【StarRocks版本】1.19.0
【集群规模】单fe转为3fe集群
【机器信息】测试环境4C8G

【详述】使用1.19.0开源版本部署fe集群时报错
【背景】单fe+3be架构运行正常,测试fe扩容到3fe集群

操作流程:

1、部署新的fe,配置文件与正在运行的单fe一致。
2、新的fe通过–helper启动,指定正在运行的fe为helper
sh start_fe.sh --helper 192.168.40.81:9010 --daemon
3、查看新fe的日志,

fe.warning.log内容如下:

2021-11-16 10:38:26,686 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:31,694 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:31,695 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:36,708 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:36,709 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:41,720 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:41,721 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:46,731 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:46,732 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:51,740 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:51,741 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]

fe.log内容如下

2021-11-16 10:38:11,184 INFO (main|1) [DorisDbFe.start():102] DorisDb FE starting…
2021-11-16 10:38:11,191 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():121] configured prior_cidrs value: 192.168.40.0/24
2021-11-16 10:38:11,203 INFO (main|1) [FrontendOptions.init():89] local address: /192.168.40.98.
2021-11-16 10:38:11,308 INFO (main|1) [ConsistencyChecker.initWorkTime():106] consistency checker will work from 23:00 to 4:00
2021-11-16 10:38:11,611 INFO (main|1) [Catalog.getHelperNodes():1151] get helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:11,651 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:11,652 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:16,661 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400
2021-11-16 10:38:16,662 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:21,671 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192.168.40.81:9010. response code: 400

此时新启动的fe进程存在。在进行下面add操作后,fe会自动停止。

4、通过mysql客户端连接到正在运行的原有fe中,执行如下命令:

MySQL [(none)]> SHOW PROC ‘/frontends’\G
*************************** 1. row ***************************
Name: 192.168.40.81_9010_1630033959965
IP: 192.168.40.81
HostName: db-testware-01.novalocal
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
IsMaster: true
ClusterId: 720115122
Join: true
Alive: true
ReplayedJournalId: 2118732
LastHeartbeat: 2021-11-16 10:36:32
IsHelper: true
ErrMsg:
1 row in set (0.043 sec)

MySQL [(none)]> alter system add follower “192.168.40.98:9010”;
Query OK, 0 rows affected (0.013 sec)

MySQL [(none)]> SHOW PROC ‘/frontends’\G
*************************** 1. row ***************************
Name: 192.168.40.81_9010_1630033959965
IP: 192.168.40.81
HostName: db-testware-01.novalocal
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
IsMaster: true
ClusterId: 720115122
Join: true
Alive: true
ReplayedJournalId: 2118778
LastHeartbeat: 2021-11-16 10:39:03
IsHelper: true
ErrMsg:
*************************** 2. row ***************************
Name: 192.168.40.98_9010_1637030333352
IP: 192.168.40.98
HostName: 192.168.40.98
EditLogPort: 9010
HttpPort: 8030
QueryPort: 0
RpcPort: 0
Role: FOLLOWER
IsMaster: false
ClusterId: 720115122
Join: false
Alive: false
ReplayedJournalId: 0
LastHeartbeat: NULL
IsHelper: true
ErrMsg: got exception
2 rows in set (0.053 sec)

可以看到新增加的fe的join和alive都是false。errmsg为got exception

5、此时查看fe的log

fe.log内容新增如下:

2021-11-16 10:38:51,740 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1048] failed to get fe node type from helper node: 192
.168.40.81:9010. response code: 400
2021-11-16 10:38:51,741 WARN (main|1) [Catalog.getClusterIdAndRole():925] current node is not added to the group. please add it first.
sleep 5 seconds and retry, current helper nodes: [192.168.40.81:9010]
2021-11-16 10:38:56,757 INFO (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1078] get fe node type FOLLOWER, name 192.168.40.98_90
10_1637030333352 from 192.168.40.81:8030
2021-11-16 10:38:57,006 INFO (main|1) [Catalog.getClusterIdAndRole():1022] finished to get cluster id: 720115122, role: FOLLOWER and no
de name: 192.168.40.98_9010_1637030333352
2021-11-16 10:38:57,047 INFO (main|1) [Catalog.loadImage():1482] start load image from /mnt/dorisdb/fe/doris-meta/image/image.2099283.
is ckpt: false
2021-11-16 10:38:57,047 INFO (main|1) [Catalog.loadHeader():1623] finished replay header from image
2021-11-16 10:38:57,051 INFO (main|1) [Catalog.loadMasterInfo():1634] finished replay masterInfo from image
2021-11-16 10:38:57,261 INFO (main|1) [Catalog.loadDb():1677] finished replay databases from image
2021-11-16 10:38:57,275 INFO (main|1) [Catalog.loadLoadJob():1708] finished replay loadJob from image
2021-11-16 10:38:57,297 INFO (main|1) [Catalog.loadAlterJob():1740] finished replay alterJob from image
2021-11-16 10:38:57,297 INFO (main|1) [Catalog.loadRecycleBin():1882] finished replay recycleBin from image
2021-11-16 10:38:57,314 INFO (main|1) [Catalog.loadGlobalVariable():2192] finished replay globalVariable from image
2021-11-16 10:38:57,318 INFO (main|1) [Catalog.loadCluster():6349] finished replay cluster from image
2021-11-16 10:38:57,328 INFO (main|1) [Catalog.loadBrokers():6446] finished replay brokerMgr from image
2021-11-16 10:38:57,330 INFO (main|1) [Catalog.loadResources():1914] finished replay resources from image
2021-11-16 10:38:57,349 INFO (main|1) [Catalog.loadExportJob():1725] finished replay exportJob from image
2021-11-16 10:38:57,354 INFO (main|1) [Catalog.loadBackupHandler():1827] finished replay backupHandler from image

fe.out新增内容如下:

java.io.IOException: failed read PrivTable
at org.apache.doris.mysql.privilege.PrivTable.read(PrivTable.java:223)
at org.apache.doris.mysql.privilege.Auth.readFields(Auth.java:1407)
at org.apache.doris.catalog.Catalog.loadAuth(Catalog.java:1852)
at org.apache.doris.catalog.Catalog.loadImage(Catalog.java:1508)
at org.apache.doris.catalog.Catalog.initialize(Catalog.java:800)
at org.apache.doris.DorisDbFe.start(DorisDbFe.java:108)
at org.apache.doris.DorisDbFe.main(DorisDbFe.java:63)
Caused by: java.lang.ClassNotFoundException: com.starrocks.mysql.privilege.UserPrivTable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.doris.mysql.privilege.PrivTable.read(PrivTable.java:213)
… 6 more

现在新增加的fe会shutdown,通过ps命令查不到fe进程。

【附件】

您先把这个fe节点删除了吧 ,
alter system drop follower “fe_host:edit_log_port”;
按照步骤走,我们先使用MySQL客户端连接已有的FE, 添加新实例的信息,然后在指定现有集群中的节点–helper参数启动 ,看是否会报错

操作如下:
命令行操作:

MySQL [(none)]> alter system drop follower “192.168.40.98:9010”;
Query OK, 0 rows affected (0.015 sec)

MySQL [(none)]> SHOW PROC ‘/frontends’\G
*************************** 1. row ***************************
Name: 192.168.40.81_9010_1630033959965
IP: 192.168.40.81
HostName: db-testware-01.novalocal
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
IsMaster: true
ClusterId: 720115122
Join: true
Alive: true
ReplayedJournalId: 2118979
LastHeartbeat: 2021-11-16 10:50:09
IsHelper: true
ErrMsg:
1 row in set (0.208 sec)

MySQL [(none)]> alter system add follower “192.168.40.98:9010”;
Query OK, 0 rows affected (0.007 sec)

MySQL [(none)]> SHOW PROC ‘/frontends’\G
*************************** 1. row ***************************
Name: 192.168.40.98_9010_1637033531693
IP: 192.168.40.98
HostName: 192.168.40.98
EditLogPort: 9010
HttpPort: 8030
QueryPort: 0
RpcPort: 0
Role: FOLLOWER
IsMaster: false
ClusterId: 720115122
Join: false
Alive: false
ReplayedJournalId: 0
LastHeartbeat: NULL
IsHelper: true
ErrMsg: got exception
*************************** 2. row ***************************
Name: 192.168.40.81_9010_1630033959965
IP: 192.168.40.81
HostName: db-testware-01.novalocal
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
IsMaster: true
ClusterId: 720115122
Join: true
Alive: true
ReplayedJournalId: 2119812
LastHeartbeat: 2021-11-16 11:32:13
IsHelper: true
ErrMsg:
2 rows in set (0.069 sec)

在新增加的fe清空之前的doris-meta、log和plugins目录,然后执行

sh start_fe.sh --helper 192.168.40.81:9010 --daemon

发现报错内容跟之前一致。

fe.log内容如下:

2021-11-16 11:32:26,695 INFO (main|1) [DorisDbFe.start():102] DorisDb FE starting…
2021-11-16 11:32:26,702 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():121] configured prior_cidrs value: 192.168.40.0/24
2021-11-16 11:32:26,715 INFO (main|1) [FrontendOptions.init():89] local address: /192.168.40.98.
2021-11-16 11:32:26,820 INFO (main|1) [ConsistencyChecker.initWorkTime():106] consistency checker will work from 23:00 to 4:00
2021-11-16 11:32:27,094 INFO (main|1) [Catalog.getHelperNodes():1151] get helper nodes: [192.168.40.81:9010]
2021-11-16 11:32:27,134 INFO (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1078] get fe node type FOLLOWER, name 192.168.40.98_90
10_1637033531693 from 192.168.40.81:8030
2021-11-16 11:32:27,383 INFO (main|1) [Catalog.getClusterIdAndRole():1022] finished to get cluster id: 720115122, role: FOLLOWER and no
de name: 192.168.40.98_9010_1637033531693
2021-11-16 11:32:27,408 INFO (main|1) [Catalog.loadImage():1482] start load image from /mnt/dorisdb/fe/doris-meta/image/image.2118992.
is ckpt: false
2021-11-16 11:32:27,408 INFO (main|1) [Catalog.loadHeader():1623] finished replay header from image
2021-11-16 11:32:27,411 INFO (main|1) [Catalog.loadMasterInfo():1634] finished replay masterInfo from image
2021-11-16 11:32:27,711 INFO (main|1) [Catalog.loadDb():1677] finished replay databases from image
2021-11-16 11:32:27,726 INFO (main|1) [Catalog.loadLoadJob():1708] finished replay loadJob from image
2021-11-16 11:32:27,749 INFO (main|1) [Catalog.loadAlterJob():1740] finished replay alterJob from image
2021-11-16 11:32:27,753 INFO (main|1) [Catalog.loadRecycleBin():1882] finished replay recycleBin from image
2021-11-16 11:32:27,779 INFO (main|1) [Catalog.loadGlobalVariable():2192] finished replay globalVariable from image
2021-11-16 11:32:27,784 INFO (main|1) [Catalog.loadCluster():6349] finished replay cluster from image
2021-11-16 11:32:27,786 INFO (main|1) [Catalog.loadBrokers():6446] finished replay brokerMgr from image
2021-11-16 11:32:27,789 INFO (main|1) [Catalog.loadResources():1914] finished replay resources from image
2021-11-16 11:32:27,810 INFO (main|1) [Catalog.loadExportJob():1725] finished replay exportJob from image
2021-11-16 11:32:27,815 INFO (main|1) [Catalog.loadBackupHandler():1827] finished replay backupHandler from image

fe.out新增内容如下:


新增加的fe依然启动失败,fe进程不存在

是这样 您这个报错是版本没对上导致的 ,您目前1fe+3be的集群是19版本吗? 我这里推测您新增的fe用的18.1版本的二进制文件启动的 您具体排查下

我这个是从1.17升级到1.19的。我再重新解压下1.19试一下。辛苦辛苦了

客气了 ,应该的 ,您这个就是版本没对上,您在排查一下用对的上的二进制文件重启就好了~

非常感谢,果然是上次测试环境升级be,忘了升级fe导致的。辛苦辛苦

客气了,应该的,以后有问题可以直接在论坛上提,也欢迎您一起共建社区~