集群扩容FE节点无法链接到主FE节点;但BE节点正常

【详述】
服务器1:172.172.0.10 主FE节点
服务器2:172.172.1.10
服务器3:172.172.2.10

Docker部署另外两个fe节点(角色无论是follower还是observer),
mysql> alter system add follower “172.172.1.10:9010”;
mysql> alter system add follower “172.172.2.10:9010”;

mysql> alter system add OBSERVER “172.172.1.10:9010”;
mysql> alter system add OBSERVER “172.172.2.10:9010”;

在172.172.1.10 或172.172.1.10 的fe.warn.log错误:
2022-06-21 20:06:28,092 WARN (main|1) [Catalog.getFeNodeTypeAndNameFromHelpers():1130] failed to get fe node type from helper node: 172.172.0.10:9010. response code: 400
2022-06-21 20:06:28,092 WARN (main|1) [Catalog.getClusterIdAndRole():1010] current node is not added to the group. please add it first. sleep 5 seconds and retry, current helper nodes: [172.172.0.10:9010]

【StarRocks版本】例如:2.2.1
【集群规模】3fe(1 follower+2observer)+ 3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆

您好,请问启动fe的时候指定helper了吗?

有的,修改了run_script.sh,具体如下

Start FE. helper

cd /data/deploy/StarRocks-2.2.1/fe/bin/
./start_fe.sh --helper 172.172.0.10:9010 --daemon

您好麻烦发下完整的fe.warn日志和fe.out日志

starrocks-docker-log.zip (312.5 KB)

MySQL [example_db]> SHOW PROC ‘/frontends’\G
*************************** 1. row ***************************
Name: 172.172.2.10_9010_1656328690816
IP: 172.172.2.10
EditLogPort: 9010
HttpPort: 8030
QueryPort: 0
RpcPort: 0
Role: FOLLOWER
IsMaster: false
ClusterId: 1265292696
Join: false
Alive: false
ReplayedJournalId: 0
LastHeartbeat: NULL
IsHelper: true
ErrMsg:
StartTime: NULL
Version: NULL
*************************** 2. row ***************************
Name: 172.172.1.10_9010_1656328685942
IP: 172.172.1.10
EditLogPort: 9010
HttpPort: 8030
QueryPort: 0
RpcPort: 0
Role: FOLLOWER
IsMaster: false
ClusterId: 1265292696
Join: false
Alive: false
ReplayedJournalId: 0
LastHeartbeat: NULL
IsHelper: true
ErrMsg: got exception
StartTime: NULL
Version: NULL
*************************** 3. row ***************************
Name: 172.172.0.10_9010_1656328369878
IP: 172.172.0.10
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
IsMaster: true
ClusterId: 1265292696
Join: true
Alive: true
ReplayedJournalId: 109
LastHeartbeat: 2022-06-27 19:18:21
IsHelper: true
ErrMsg:
StartTime: 2022-06-27 19:13:02
Version: 2.2.1-147f178
3 rows in set (0.05 sec)

MySQL [example_db]> show proc ‘/backends’\G
*************************** 1. row ***************************
BackendId: 10008
Cluster: default_cluster
IP: 172.172.1.10
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-06-27 19:18:16
LastHeartbeat: 2022-06-27 19:18:41
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 0
DataUsedCapacity: .000
AvailCapacity: 8.990 GB
TotalCapacity: 16.989 GB
UsedPct: 47.08 %
MaxDiskUsedPct: 47.08 %
ErrMsg:
Version: 2.2.1-147f178
Status: {“lastSuccessReportTabletsTime”:“2022-06-27 19:18:17”}
DataTotalCapacity: 8.990 GB
DataUsedPct: 0.00 %
CpuCores: 2
*************************** 2. row ***************************
BackendId: 10009
Cluster: default_cluster
IP: 172.172.2.10
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-06-27 19:18:21
LastHeartbeat: 2022-06-27 19:18:41
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 0
DataUsedCapacity: .000
AvailCapacity: 8.961 GB
TotalCapacity: 16.989 GB
UsedPct: 47.25 %
MaxDiskUsedPct: 47.25 %
ErrMsg:
Version: 2.2.1-147f178
Status: {“lastSuccessReportTabletsTime”:“2022-06-27 19:18:21”}
DataTotalCapacity: 8.961 GB
DataUsedPct: 0.00 %
CpuCores: 2
*************************** 3. row ***************************
BackendId: 10002
Cluster: default_cluster
IP: 172.172.0.10
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-06-27 19:13:21
LastHeartbeat: 2022-06-27 19:18:41
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 0
DataUsedCapacity: .000
AvailCapacity: 7.165 GB
TotalCapacity: 16.989 GB
UsedPct: 57.82 %
MaxDiskUsedPct: 57.82 %
ErrMsg:
Version: 2.2.1-147f178
Status: {“lastSuccessReportTabletsTime”:“2022-06-27 19:18:22”}
DataTotalCapacity: 7.165 GB
DataUsedPct: 0.00 %
CpuCores: 2
3 rows in set (0.01 sec)

新加的这台节点和master那台机器,两个机器之间可以正常通信吗

一、主机(容器里面) ping/telnet slave机器的情况:
[root@starrockmaster master]# docker exec -it 8a9b85193f64 /bin/sh
sh-4.2# ping 172.172.1.10
PING 172.172.1.10 (172.172.1.10) 56(84) bytes of data.
64 bytes from 172.172.1.10: icmp_seq=1 ttl=62 time=0.666 ms
^C
— 172.172.1.10 ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.666/0.666/0.666/0.000 ms
sh-4.2# telnet 172.172.1.10 9010
Trying 172.172.1.10…
telnet: connect to address 172.172.1.10: Connection refused
sh-4.2# telnet 172.172.1.10 8030
Trying 172.172.1.10…
telnet: connect to address 172.172.1.10: Connection refused
sh-4.2# telnet 172.172.1.10 8050
Trying 172.172.1.10…
telnet: connect to address 172.172.1.10: Connection refused

二、slave机器(容器里面) ping/telnet 主机的情况:
[root@starrocksslave01 slave]# docker exec -it 9148b3374b50 /bin/sh
sh-4.2# ping 172.172.0.10
PING 172.172.0.10 (172.172.0.10) 56(84) bytes of data.
64 bytes from 172.172.0.10: icmp_seq=1 ttl=62 time=0.592 ms
64 bytes from 172.172.0.10: icmp_seq=2 ttl=62 time=0.715 ms
^C
— 172.172.0.10 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.592/0.653/0.715/0.066 ms
sh-4.2# telnet 172.172.0.10 9010
Trying 172.172.0.10…
Connected to 172.172.0.10.
Escape character is ‘^]’.
^C^CConnection closed by foreign host.
sh-4.2# telnet 172.172.0.10 8050
Trying 172.172.0.10…
telnet: connect to address 172.172.0.10: Connection refused
sh-4.2# telnet 172.172.0.10 8030
Trying 172.172.0.10…
Connected to 172.172.0.10.
Escape character is ‘^]’.

URL url = new URL("http://" + helperNode.first + ":" + Config.http_port
                        + "/role?host=" + selfNode.first + "&port=" + selfNode.second);
                HttpURLConnection conn = null;
                conn = (HttpURLConnection) url.openConnection();
                if (conn.getResponseCode() != 200) {
                    LOG.warn("failed to get fe node type from helper node: {}. response code: {}",
                            helperNode, conn.getResponseCode());
                    continue;
                }

看这段代码我们就明白了
./bin/start_fe.sh --helper host:port --daemon
helper 只指定了helper 的 ip 和 edit_log_port,没有指定helper的http_port,sr根据这个helper的ip和本地进程的fe.conf里的http_port来连helper的http服务,发现连不上了

再看下另一段代码

/**
     * Fe http port
     * Currently, all FEs' http port must be same.
     */
    @ConfField
    public static int http_port = 8030;

要求了所有的fe的http端口必须一样

所以解决办法就是,把你要扩容fe节点的http端口设置和历史的fe节点的http端口一样就好了。。。。