【详述】fe部分节点数据异常,但是show proc没有显示异常情况
【背景】发现audit审计日志在部分节点没同步到StarRocks的表里,于是先uninstall plugin AuditLoader; 然后再次install plugin AuditLoader; 安装插件后show plugins的状态显示还是处于uninstalling中
【业务影响】 部分表查询失败
【是否存算分离】否
【StarRocks版本】2.5.5
【集群规模】6fe(3follower+3observer)+6be(fe与be混部)
【机器信息】16C/64G/千兆
【联系方式】lirulei90#126.com
【附件】
从show proc看,fe和be都是正常的
[2024/10/11 15:22:35] mysql> SHOW PROC '/frontends' ;
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] | Name | IP | EditLogPort | HttpPort | QueryPort | RpcPort | Role | ClusterId | Join | Alive | ReplayedJournalId | LastHeartbeat | IsHelper | ErrMsg | StartTime | Version |
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] | 172.19.135.215_9010_1665545759176 | 172.19.135.215 | 9010 | 8039 | 9030 | 9020 | FOLLOWER | 889910524 | true | true | 95776818 | 2024-10-11 11:54:52 | true | | 2024-09-02 15:42:11 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.216_9010_1665544743989 | 172.19.135.216 | 9010 | 8039 | 9030 | 9020 | LEADER | 889910524 | true | true | 95776818 | 2024-10-11 11:54:52 | true | | 2023-05-17 15:28:49 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.214_9010_1665544416689 | 172.19.135.214 | 9010 | 8039 | 9030 | 9020 | FOLLOWER | 889910524 | true | true | 95776818 | 2024-10-11 11:54:52 | true | | 2023-05-17 15:23:02 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.213_9010_1665544749947 | 172.19.135.213 | 9010 | 8039 | 9030 | 9020 | OBSERVER | 889910524 | true | true | 95814752 | 2024-10-11 11:54:52 | true | | 2023-05-17 15:22:11 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.217_9010_1665544749954 | 172.19.135.217 | 9010 | 8039 | 9030 | 9020 | OBSERVER | 889910524 | true | true | 95776818 | 2024-10-11 11:54:52 | false | | 2023-05-17 15:25:50 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.221_9010_1665544749960 | 172.19.135.221 | 9010 | 8039 | 9030 | 9020 | OBSERVER | 889910524 | true | true | 95776818 | 2024-10-11 11:54:52 | false | | 2023-05-17 15:26:46 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] 6 rows in set (0.02 sec)
[2024/10/11 15:47:57] mysql> show backends;
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] | BackendId | IP | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime | LastHeartbeat | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version | Status | DataTotalCapacity | DataUsedPct | CpuCores | NumRunningQueries | MemUsedPct | CpuUsedPct |
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] | 10003 | 172.19.135.213 | 9050 | 9060 | 8049 | 8060 | 2024-10-11 13:30:05 | 2024-10-11 15:47:55 | true | false | false | 10379 | 462.151 GB | 339.370 GB | 1007.802 GB | 66.33 % | 66.33 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:10"} | 801.521 GB | 57.66 % | 16 | 7 | 41.21 % | 0.0 % |
[2024/10/11 15:47:57] | 10004 | 172.19.135.214 | 9050 | 9060 | 8049 | 8060 | 2024-10-11 13:30:05 | 2024-10-11 15:47:55 | true | false | false | 10428 | 441.538 GB | 339.013 GB | 1007.802 GB | 66.36 % | 66.36 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:11"} | 780.551 GB | 56.57 % | 16 | 7 | 41.50 % | 0.1 % |
[2024/10/11 15:47:57] | 10005 | 172.19.135.215 | 9050 | 9060 | 8049 | 8060 | 2024-09-02 15:42:06 | 2024-10-11 15:47:55 | true | false | false | 10162 | 500.253 GB | 318.578 GB | 1007.802 GB | 68.39 % | 68.39 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:17"} | 818.832 GB | 61.09 % | 16 | 7 | 47.78 % | 0.1 % |
[2024/10/11 15:47:57] | 10006 | 172.19.135.216 | 9050 | 9060 | 8049 | 8060 | 2024-09-02 14:36:20 | 2024-10-11 15:47:55 | true | false | false | 10600 | 515.882 GB | 384.140 GB | 1007.802 GB | 61.88 % | 61.88 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:29"} | 900.022 GB | 57.32 % | 16 | 7 | 49.81 % | 0.1 % |
[2024/10/11 15:47:57] | 10007 | 172.19.135.217 | 9050 | 9060 | 8049 | 8060 | 2024-09-13 10:05:08 | 2024-10-11 15:47:55 | true | false | false | 10203 | 511.056 GB | 390.438 GB | 1007.802 GB | 61.26 % | 61.26 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:00"} | 901.494 GB | 56.69 % | 16 | 7 | 41.67 % | 0.1 % |
[2024/10/11 15:47:57] | 10008 | 172.19.135.221 | 9050 | 9060 | 8049 | 8060 | 2024-09-02 14:45:08 | 2024-10-11 15:47:55 | true | false | false | 10613 | 504.585 GB | 419.876 GB | 1007.801 GB | 58.34 % | 58.34 % | | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:09"} | 924.461 GB | 54.58 % | 16 | 7 | 42.21 % | 0.1 % |
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] 6 rows in set (0.01 sec)
mysql>
mysql> show tablet 16814941 ;
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
| DbName | TableName | PartitionName | IndexName | DbId | TableId | PartitionId | IndexId | IsSync | DetailCmd |
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
| dwd | dwd_tb1 | p202012 | dwd_tb1 | 10058 | 15543645 | 16210757 | 16814052 | true | SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941'; |
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> show tablet 16814941 \G
*************************** 1. row ***************************
DbName: dwd
TableName: dwd_tb1
PartitionName: p202012
IndexName: dwd_tb1
DbId: 10058
TableId: 15543645
PartitionId: 16210757
IndexId: 16814052
IsSync: true
DetailCmd: SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941';
1 row in set (0.00 sec)
mysql> SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941';
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
| ReplicaId | BackendId | Version | VersionHash | LstSuccessVersion | LstSuccessVersionHash | LstFailedVersion | LstFailedVersionHash | LstFailedTime | SchemaHash | DataSize | RowCount | State | IsBad | IsSetBadForce | VersionCount | PathHash | MetaUrl | CompactionStatus | IsErrorState |
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
| 16814942 | 10003 | 2 | 0 | 2 | 0 | -1 | 0 | NULL | -1 | 5087790 | 21074 | NORMAL | false | false | 1 | -1 | http://172.19.135.213:8049/api/meta/header/16814941 | http://172.19.135.213:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false |
| 16814943 | 10008 | 2 | 0 | 2 | 0 | -1 | 0 | NULL | -1 | 5055500 | 21074 | NORMAL | false | false | 1 | -1 | http://172.19.135.221:8049/api/meta/header/16814941 | http://172.19.135.221:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false |
| 16814944 | 10004 | 2 | 0 | 2 | 0 | -1 | 0 | NULL | -1 | 5055500 | 21074 | NORMAL | false | false | 1 | -1 | http://172.19.135.214:8049/api/meta/header/16814941 | http://172.19.135.214:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false |
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
3 rows in set (0.00 sec)
mysql> SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941' \G
*************************** 1. row ***************************
ReplicaId: 16814942
BackendId: 10003
Version: 2
VersionHash: 0
LstSuccessVersion: 2
LstSuccessVersionHash: 0
LstFailedVersion: -1
LstFailedVersionHash: 0
LstFailedTime: NULL
SchemaHash: -1
DataSize: 5087790
RowCount: 21074
State: NORMAL
IsBad: false
IsSetBadForce: false
VersionCount: 1
PathHash: -1
MetaUrl: http://172.19.135.213:8049/api/meta/header/16814941
CompactionStatus: http://172.19.135.213:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
IsErrorState: false
*************************** 2. row ***************************
ReplicaId: 16814943
BackendId: 10008
Version: 2
VersionHash: 0
LstSuccessVersion: 2
LstSuccessVersionHash: 0
LstFailedVersion: -1
LstFailedVersionHash: 0
LstFailedTime: NULL
SchemaHash: -1
DataSize: 5055500
RowCount: 21074
State: NORMAL
IsBad: false
IsSetBadForce: false
VersionCount: 1
PathHash: -1
MetaUrl: http://172.19.135.221:8049/api/meta/header/16814941
CompactionStatus: http://172.19.135.221:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
IsErrorState: false
*************************** 3. row ***************************
ReplicaId: 16814944
BackendId: 10004
Version: 2
VersionHash: 0
LstSuccessVersion: 2
LstSuccessVersionHash: 0
LstFailedVersion: -1
LstFailedVersionHash: 0
LstFailedTime: NULL
SchemaHash: -1
DataSize: 5055500
RowCount: 21074
State: NORMAL
IsBad: false
IsSetBadForce: false
VersionCount: 1
PathHash: -1
MetaUrl: http://172.19.135.214:8049/api/meta/header/16814941
CompactionStatus: http://172.19.135.214:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
IsErrorState: false
3 rows in set (0.00 sec)
但是,经过多次测试连接不同的fe节点去执行select命令,发现 172.19.135.213 和 172.19.135.217 这2台FE有问题的,其余的fe都是正常的。 报错类似如下:
[2024/10/11 15:51:02] mysql> select * from dwd_tb1 limit 1 \G
[2024/10/11 15:51:02] ERROR 1064 (HY000): failed to get tablet. tablet_id=16814933, with schema_hash=252439453, reason=tablet does not exist backend:172.19.135.214
临时的解决办法: 对这2个fe从集群中踢掉后清掉数据重新加FE集群后就正常了。