集群内FE数据异常但是show proc查看正常

【详述】fe部分节点数据异常,但是show proc没有显示异常情况
【背景】发现audit审计日志在部分节点没同步到StarRocks的表里,于是先uninstall plugin AuditLoader; 然后再次install plugin AuditLoader; 安装插件后show plugins的状态显示还是处于uninstalling中
【业务影响】 部分表查询失败
【是否存算分离】否
【StarRocks版本】2.5.5
【集群规模】6fe(3follower+3observer)+6be(fe与be混部)
【机器信息】16C/64G/千兆
【联系方式】lirulei90#126.com
【附件】

从show proc看,fe和be都是正常的
[2024/10/11 15:22:35] mysql> SHOW PROC '/frontends' ;
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] | Name                              | IP             | EditLogPort | HttpPort | QueryPort | RpcPort | Role     | ClusterId | Join | Alive | ReplayedJournalId | LastHeartbeat       | IsHelper | ErrMsg | StartTime           | Version       |
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] | 172.19.135.215_9010_1665545759176 | 172.19.135.215 | 9010        | 8039     | 9030      | 9020    | FOLLOWER | 889910524 | true | true  | 95776818          | 2024-10-11 11:54:52 | true     |        | 2024-09-02 15:42:11 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.216_9010_1665544743989 | 172.19.135.216 | 9010        | 8039     | 9030      | 9020    | LEADER   | 889910524 | true | true  | 95776818          | 2024-10-11 11:54:52 | true     |        | 2023-05-17 15:28:49 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.214_9010_1665544416689 | 172.19.135.214 | 9010        | 8039     | 9030      | 9020    | FOLLOWER | 889910524 | true | true  | 95776818          | 2024-10-11 11:54:52 | true     |        | 2023-05-17 15:23:02 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.213_9010_1665544749947 | 172.19.135.213 | 9010        | 8039     | 9030      | 9020    | OBSERVER | 889910524 | true | true  | 95814752          | 2024-10-11 11:54:52 | true     |        | 2023-05-17 15:22:11 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.217_9010_1665544749954 | 172.19.135.217 | 9010        | 8039     | 9030      | 9020    | OBSERVER | 889910524 | true | true  | 95776818          | 2024-10-11 11:54:52 | false    |        | 2023-05-17 15:25:50 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] | 172.19.135.221_9010_1665544749960 | 172.19.135.221 | 9010        | 8039     | 9030      | 9020    | OBSERVER | 889910524 | true | true  | 95776818          | 2024-10-11 11:54:52 | false    |        | 2023-05-17 15:26:46 | 2.5.5-24c1eca |
[2024/10/11 15:22:35] +-----------------------------------+----------------+-------------+----------+-----------+---------+----------+-----------+------+-------+-------------------+---------------------+----------+--------+---------------------+---------------+
[2024/10/11 15:22:35] 6 rows in set (0.02 sec)


[2024/10/11 15:47:57] mysql> show backends;                                                                                                                                                                             
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] | BackendId | IP             | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime       | LastHeartbeat       | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version       | Status                                                 | DataTotalCapacity | DataUsedPct | CpuCores | NumRunningQueries | MemUsedPct | CpuUsedPct |
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] | 10003     | 172.19.135.213 | 9050          | 9060   | 8049     | 8060     | 2024-10-11 13:30:05 | 2024-10-11 15:47:55 | true  | false                | false                 | 10379     | 462.151 GB       | 339.370 GB    | 1007.802 GB   | 66.33 % | 66.33 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:10"} | 801.521 GB        | 57.66 %     | 16       | 7                 | 41.21 %    | 0.0 %      |
[2024/10/11 15:47:57] | 10004     | 172.19.135.214 | 9050          | 9060   | 8049     | 8060     | 2024-10-11 13:30:05 | 2024-10-11 15:47:55 | true  | false                | false                 | 10428     | 441.538 GB       | 339.013 GB    | 1007.802 GB   | 66.36 % | 66.36 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:11"} | 780.551 GB        | 56.57 %     | 16       | 7                 | 41.50 %    | 0.1 %      |
[2024/10/11 15:47:57] | 10005     | 172.19.135.215 | 9050          | 9060   | 8049     | 8060     | 2024-09-02 15:42:06 | 2024-10-11 15:47:55 | true  | false                | false                 | 10162     | 500.253 GB       | 318.578 GB    | 1007.802 GB   | 68.39 % | 68.39 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:17"} | 818.832 GB        | 61.09 %     | 16       | 7                 | 47.78 %    | 0.1 %      |
[2024/10/11 15:47:57] | 10006     | 172.19.135.216 | 9050          | 9060   | 8049     | 8060     | 2024-09-02 14:36:20 | 2024-10-11 15:47:55 | true  | false                | false                 | 10600     | 515.882 GB       | 384.140 GB    | 1007.802 GB   | 61.88 % | 61.88 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:29"} | 900.022 GB        | 57.32 %     | 16       | 7                 | 49.81 %    | 0.1 %      |
[2024/10/11 15:47:57] | 10007     | 172.19.135.217 | 9050          | 9060   | 8049     | 8060     | 2024-09-13 10:05:08 | 2024-10-11 15:47:55 | true  | false                | false                 | 10203     | 511.056 GB       | 390.438 GB    | 1007.802 GB   | 61.26 % | 61.26 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:00"} | 901.494 GB        | 56.69 %     | 16       | 7                 | 41.67 %    | 0.1 %      |
[2024/10/11 15:47:57] | 10008     | 172.19.135.221 | 9050          | 9060   | 8049     | 8060     | 2024-09-02 14:45:08 | 2024-10-11 15:47:55 | true  | false                | false                 | 10613     | 504.585 GB       | 419.876 GB    | 1007.801 GB   | 58.34 % | 58.34 %        |        | 2.5.5-24c1eca | {"lastSuccessReportTabletsTime":"2024-10-11 15:47:09"} | 924.461 GB        | 54.58 %     | 16       | 7                 | 42.21 %    | 0.1 %      |
[2024/10/11 15:47:57] +-----------+----------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+
[2024/10/11 15:47:57] 6 rows in set (0.01 sec)


mysql> 
mysql> show tablet 16814941 ;
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
| DbName | TableName                  | PartitionName | IndexName                  | DbId  | TableId  | PartitionId | IndexId  | IsSync | DetailCmd                                                              |
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
| dwd    | dwd_tb1 | p202012       | dwd_tb1 | 10058 | 15543645 | 16210757    | 16814052 | true   | SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941'; |
+--------+----------------------------+---------------+----------------------------+-------+----------+-------------+----------+--------+------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show tablet 16814941 \G
*************************** 1. row ***************************
       DbName: dwd
    TableName: dwd_tb1
PartitionName: p202012
    IndexName: dwd_tb1
         DbId: 10058
      TableId: 15543645
  PartitionId: 16210757
      IndexId: 16814052
       IsSync: true
    DetailCmd: SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941';
1 row in set (0.00 sec)

mysql> SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941';
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
| ReplicaId | BackendId | Version | VersionHash | LstSuccessVersion | LstSuccessVersionHash | LstFailedVersion | LstFailedVersionHash | LstFailedTime | SchemaHash | DataSize | RowCount | State  | IsBad | IsSetBadForce | VersionCount | PathHash | MetaUrl                                             | CompactionStatus                                                                 | IsErrorState |
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
| 16814942  | 10003     | 2       | 0           | 2                 | 0                     | -1               | 0                    | NULL          | -1         | 5087790  | 21074    | NORMAL | false | false         | 1            | -1       | http://172.19.135.213:8049/api/meta/header/16814941 | http://172.19.135.213:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false        |
| 16814943  | 10008     | 2       | 0           | 2                 | 0                     | -1               | 0                    | NULL          | -1         | 5055500  | 21074    | NORMAL | false | false         | 1            | -1       | http://172.19.135.221:8049/api/meta/header/16814941 | http://172.19.135.221:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false        |
| 16814944  | 10004     | 2       | 0           | 2                 | 0                     | -1               | 0                    | NULL          | -1         | 5055500  | 21074    | NORMAL | false | false         | 1            | -1       | http://172.19.135.214:8049/api/meta/header/16814941 | http://172.19.135.214:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1 | false        |
+-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+---------------+--------------+----------+-----------------------------------------------------+----------------------------------------------------------------------------------+--------------+
3 rows in set (0.00 sec)

mysql> SHOW PROC '/dbs/10058/15543645/partitions/16210757/16814052/16814941' \G
*************************** 1. row ***************************
            ReplicaId: 16814942
            BackendId: 10003
              Version: 2
          VersionHash: 0
    LstSuccessVersion: 2
LstSuccessVersionHash: 0
     LstFailedVersion: -1
 LstFailedVersionHash: 0
        LstFailedTime: NULL
           SchemaHash: -1
             DataSize: 5087790
             RowCount: 21074
                State: NORMAL
                IsBad: false
        IsSetBadForce: false
         VersionCount: 1
             PathHash: -1
              MetaUrl: http://172.19.135.213:8049/api/meta/header/16814941
     CompactionStatus: http://172.19.135.213:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
         IsErrorState: false
*************************** 2. row ***************************
            ReplicaId: 16814943
            BackendId: 10008
              Version: 2
          VersionHash: 0
    LstSuccessVersion: 2
LstSuccessVersionHash: 0
     LstFailedVersion: -1
 LstFailedVersionHash: 0
        LstFailedTime: NULL
           SchemaHash: -1
             DataSize: 5055500
             RowCount: 21074
                State: NORMAL
                IsBad: false
        IsSetBadForce: false
         VersionCount: 1
             PathHash: -1
              MetaUrl: http://172.19.135.221:8049/api/meta/header/16814941
     CompactionStatus: http://172.19.135.221:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
         IsErrorState: false
*************************** 3. row ***************************
            ReplicaId: 16814944
            BackendId: 10004
              Version: 2
          VersionHash: 0
    LstSuccessVersion: 2
LstSuccessVersionHash: 0
     LstFailedVersion: -1
 LstFailedVersionHash: 0
        LstFailedTime: NULL
           SchemaHash: -1
             DataSize: 5055500
             RowCount: 21074
                State: NORMAL
                IsBad: false
        IsSetBadForce: false
         VersionCount: 1
             PathHash: -1
              MetaUrl: http://172.19.135.214:8049/api/meta/header/16814941
     CompactionStatus: http://172.19.135.214:8049/api/compaction/show?tablet_id=16814941&schema_hash=-1
         IsErrorState: false
3 rows in set (0.00 sec)

但是,经过多次测试连接不同的fe节点去执行select命令,发现 172.19.135.213 和 172.19.135.217 这2台FE有问题的,其余的fe都是正常的。 报错类似如下:

[2024/10/11 15:51:02] mysql> select * from  dwd_tb1 limit 1 \G
[2024/10/11 15:51:02] ERROR 1064 (HY000): failed to get tablet. tablet_id=16814933, with schema_hash=252439453, reason=tablet does not exist backend:172.19.135.214

临时的解决办法: 对这2个fe从集群中踢掉后清掉数据重新加FE集群后就正常了。

1赞

172.19.135.213、172.19.135.217、172.19.135.221 这三个fe的状态看着都不太对,对比一下ReplayedJournalId 和 IsHelper状态。
2.5.5版本太低了,可以考虑升级。