背景
本文描述在1FE+1BE starRocks集群下, 按默认IP模式部署, 因IP更换导致集群无法启动的手动恢复步骤.
本文描述的步骤仅在非常规条件下做为应急处理手段, 不建议做为正常运维实践. 在实际应用中, 如果FE/BE需要经常更换IP, 建议升级到2.5版本后, 使用FQDN模式, 用域名代替IP. 如何使用FQDN模式, 参考这儿的文档.
本文示例
以StarRocks-2.5.5版本为例, 使用172.17.0.8 IP创建数据库和表, 并导入数据. 关闭FE/BE后, IP改为172.17.0.9, 手动恢复StarRocks集群的过程.
1FE + 1BE集群搭建
0. 检查环境
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
20077: eth0@if20078: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:08 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.8/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
1. 下载安装包并解压
wget https://releases.starrocks.io/starrocks/StarRocks-2.5.5.tar.gz
tar -xzf StarRocks-2.5.5.tar.gz
# FE/BE extract to StarRocks-2.5.5/ directory
2. 创建FE meta目录, BE Storage目录
cd StarRocks-2.5.5
mkdir -p fe/meta be/storage
3. 启动FE和BE
cd StarRocks-2.5.5
./fe/bin/start_fe.sh --daemon
./be/bin/start_be.sh --daemon
4. BE加入集群
mysql -h 172.17.0.8 -P 9030 -u root
mysql> ALTER SYSTEM ADD BACKEND '172.17.0.8:9050';
Query OK, 0 rows affected (0.02 sec)
mysql> SHOW FRONTENDS \G
*************************** 1. row ***************************
Name: 172.17.0.8_9010_1684246327820
IP: 172.17.0.8
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: LEADER
ClusterId: 629157983
Join: true
Alive: true
ReplayedJournalId: 14
LastHeartbeat: 2023-05-16 14:12:30
IsHelper: true
ErrMsg:
StartTime: 2023-05-16 14:12:17
Version: 2.5.5-24c1eca
1 row in set (0.02 sec)
mysql> SHOW BACKENDS \G
*************************** 1. row ***************************
BackendId: 10002
IP: 172.17.0.8
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2023-05-16 14:12:30
LastHeartbeat: 2023-05-16 14:12:35
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 0
DataUsedCapacity: 0.000
AvailCapacity: 517.644 GB
TotalCapacity: 1.968 TB
UsedPct: 74.32 %
MaxDiskUsedPct: 74.32 %
ErrMsg:
Version: 2.5.5-24c1eca
Status: {"lastSuccessReportTabletsTime":"2023-05-16 14:12:31"}
DataTotalCapacity: 517.644 GB
DataUsedPct: 0.00 %
CpuCores: 104
NumRunningQueries: 0
MemUsedPct: 0.08 %
CpuUsedPct: 0.0 %
1 row in set (0.01 sec)
5. 创建database和table并导入数据
mysql> CREATE DATABASE test;
Query OK, 0 rows affected (0.01 sec)
mysql> USE test;
Database changed
mysql> CREATE TABLE IF NOT EXISTS sr_member (
sr_id INT,
name STRING,
city_code INT,
reg_date DATE,
verified BOOLEAN
)
PARTITION BY RANGE(reg_date)
(
PARTITION p1 VALUES [('2022-03-13'), ('2022-03-14')),
PARTITION p2 VALUES [('2022-03-14'), ('2022-03-15')),
PARTITION p3 VALUES [('2022-03-15'), ('2022-03-16')),
PARTITION p4 VALUES [('2022-03-16'), ('2022-03-17')),
PARTITION p5 VALUES [('2022-03-17'), ('2022-03-18'))
)
DISTRIBUTED BY HASH(city_code)
PROPERTIES(
"replication_num" = "1"
);
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO sr_member
WITH LABEL insertDemo
VALUES
(
001
,"tom",
100000
,"2022-03-13",true),
(
002
,"johndoe",
210000
,"2022-03-14",false),
(
003
,"maruko",
200000
,"2022-03-14",true),
(
004
,"ronaldo",
100000
,"2022-03-15",false),
(
005
,"pavlov",
210000
,"2022-03-16",false),
(
006
,"mohammed",
300000
,"2022-03-17",true);
Query OK, 6 rows affected (0.43 sec)
{'label':'insertDemo', 'status':'VISIBLE', 'txnId':'2'}
mysql> SELECT * FROM sr_member;
+-------+----------+-----------+------------+----------+
| sr_id | name | city_code | reg_date | verified |
+-------+----------+-----------+------------+----------+
| 4 | ronaldo | 100000 | 2022-03-15 | 0 |
| 2 | johndoe | 210000 | 2022-03-14 | 0 |
| 3 | maruko | 200000 | 2022-03-14 | 1 |
| 5 | pavlov | 210000 | 2022-03-16 | 0 |
| 1 | tom | 100000 | 2022-03-13 | 1 |
| 6 | mohammed | 300000 | 2022-03-17 | 1 |
+-------+----------+-----------+------------+----------+
6 rows in set (0.10 sec)
5. 停止FE/BE服务
cd StarRocks-2.5.5
./fe/bin/stop_fe.sh
./be/bin/stop_be.sh
修改IP地址, 将IP从172.17.0.8改为172.17.0.9
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
20081: eth0@if20082: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:09 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.9/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
手动恢复集群
此时用fe/bin/start_fe.sh
, be/bin/start_be.sh
不能正常启动, FE找不到集群的LEADER, BE找不到可用的FE节点.
1. 备份数据
cd StarRocks-2.5.5
# backup fe/meta and be/storage in case unexpected corruption of data
cp -a fe/meta fe/meta.bak
cp -a be/storage be/storage.bak
2. 修改fe.conf配置
cd StarRocks-2.5.5
echo "metadata_failure_recovery = true" >> fe/conf/fe.conf
3. 删除FE meta ROLE/VERSION信息
cd StarRocks-2.5.5
rm -f fe/meta/image/ROLE fe/meta/image/VERSION
4. 启动FE
cd StarRocks-2.5.5
./fe/bin/start_fe.sh --daemon
FE正常启动后, 用mysql
登录到FE
mysql -h 172.17.0.9 -P 9030 -u root
mysql> SHOW FRONTENDS\G
*************************** 1. row ***************************
Name: 172.17.0.8_9010_1684246327820
IP: 172.17.0.8
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: FOLLOWER
ClusterId: 721097542
Join: false
Alive: false
ReplayedJournalId: 216
LastHeartbeat: 2023-05-16 14:21:20
IsHelper: true
ErrMsg: got exception
StartTime: NULL
Version: 2.5.5-24c1eca
*************************** 2. row ***************************
Name: 172.17.0.9_9010_1684247343434
IP: 172.17.0.9
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: LEADER
ClusterId: 721097542
Join: true
Alive: true
ReplayedJournalId: 235
LastHeartbeat: 2023-05-16 14:29:48
IsHelper: true
ErrMsg:
StartTime: 2023-05-16 14:29:09
Version: 2.5.5-24c1eca
2 rows in set (0.03 sec)
SHOW FRONTENDS
显示有两个FE节点, 将旧的FE节点删除
mysql> ALTER SYSTEM DROP FOLLOWER '172.17.0.8:9010';
Query OK, 0 rows affected (0.01 sec)
启动BE
将be/storage/下的cluster_id文件删除后, 启动BE进程
# delete cluster_id first
rm -f be/storage/cluster_id
# start be
./be/bin/start_be.sh --daemon
将BE节点当作新节点加入集群
mysql -h 172.17.0.9 -P 9030 -u root
mysql> ALTER SYSTEM ADD BACKEND '172.17.0.9:9050';
Query OK, 0 rows affected (0.01 sec)
稍等一段时间, 等BE能够正常汇报状态
*************************** 2. row ***************************
BackendId: 11001
IP: 172.17.0.9
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2023-05-16 14:33:58
LastHeartbeat: 2023-05-16 14:34:33
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 40
DataUsedCapacity: 24.606 KB
AvailCapacity: 517.564 GB
TotalCapacity: 1.968 TB
UsedPct: 74.32 %
MaxDiskUsedPct: 74.32 %
ErrMsg:
Version: 2.5.5-24c1eca
Status: {"lastSuccessReportTabletsTime":"2023-05-16 14:33:59"}
DataTotalCapacity: 517.564 GB
DataUsedPct: 0.00 %
CpuCores: 104
NumRunningQueries: 0
MemUsedPct: 0.08 %
CpuUsedPct: 0.0 %
2 rows in set (0.00 sec)
将旧的BE IP drop
mysql> ALTER SYSTEM DROP BACKEND '172.17.0.8:9050';
Query OK, 0 rows affected (0.01 sec)
注意: 如果DROP时, 显示如下提示, 则表示新BE还没有正常工作, 旧BE不能DROP
ERROR 1064 (HY000): Unexpected exception: Tables such as [xxxxxx] on the backend[172.17.0.8:9050] have only one replica.
修改FE fe.conf
停止fe服务, 将metadata_failure_recovery = true
配置删除, 再启动fe服务.
对比恢复后的集群及数据
mysql> SHOW FRONTENDS \G
*************************** 1. row ***************************
Name: 172.17.0.9_9010_1684247343434
IP: 172.17.0.9
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: LEADER
ClusterId: 721097542
Join: true
Alive: true
ReplayedJournalId: 484
LastHeartbeat: 2023-05-16 14:39:06
IsHelper: true
ErrMsg:
StartTime: 2023-05-16 14:38:53
Version: 2.5.5-24c1eca
1 row in set (0.04 sec)
mysql> SHOW BACKENDS \G
*************************** 1. row ***************************
BackendId: 11001
IP: 172.17.0.9
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2023-05-16 14:33:58
LastHeartbeat: 2023-05-16 14:39:11
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 40
DataUsedCapacity: 24.606 KB
AvailCapacity: 517.561 GB
TotalCapacity: 1.968 TB
UsedPct: 74.32 %
MaxDiskUsedPct: 74.32 %
ErrMsg:
Version: 2.5.5-24c1eca
Status: {"lastSuccessReportTabletsTime":"2023-05-16 14:38:59"}
DataTotalCapacity: 517.561 GB
DataUsedPct: 0.00 %
CpuCores: 104
NumRunningQueries: 0
MemUsedPct: 0.08 %
CpuUsedPct: 0.0 %
1 row in set (0.00 sec)
mysql> USE test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT * FROM sr_member;
+-------+----------+-----------+------------+----------+
| sr_id | name | city_code | reg_date | verified |
+-------+----------+-----------+------------+----------+
| 2 | johndoe | 210000 | 2022-03-14 | 0 |
| 3 | maruko | 200000 | 2022-03-14 | 1 |
| 4 | ronaldo | 100000 | 2022-03-15 | 0 |
| 5 | pavlov | 210000 | 2022-03-16 | 0 |
| 6 | mohammed | 300000 | 2022-03-17 | 1 |
| 1 | tom | 100000 | 2022-03-13 | 1 |
+-------+----------+-----------+------------+----------+
6 rows in set (0.11 sec)