副本自动均衡

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】模拟副本丢失,集群自动均衡副本
【背景】直接删除了节点A的data目录。
【业务影响】副本无法自动恢复,show tablet查看副本均为健康状态,查询报错文件not found。
【是否存算分离】否
【StarRocks版本】例如:3.1.4
【集群规模】例如:3fe+12be(
【机器信息】CPU虚拟核/内存/网卡,例如:32C/256G/万兆

B节点be节点日志:
I0409 14:43:48.710940 3453521 compaction_task.cpp:135] compaction finish. status:Internal error: reader get_next error: Not found: /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory
/data/fuxi_ci_workspace/71824250/be/src/storage/vertical_compaction_task.cpp:38 _vertical_compaction_data(&statistics), task info:[CompactionTaskInfo] task_id:4064, tablet_id:1314182, compaction score:7.06857, algorithm:VERTICAL_COMPACTION, state:COMPACTION_FAILED, compaction_type:base, output_version:[0-2217], start_time:2024-04-09 14:43:48.710, end_time:2024-04-09 14:43:48.710, elapsed_time:201 us, input_rowsets_size:155100903, input_segments_num:3, input_rowsets_num:3, input_rows_num:14307921, output_num_rows:0, merged_rows:0, filtered_rows:0, output_segments_num:0, output_rowset_size:0, column_group_size:3, total_output_num_rows:0, total_merged_rows:0, total_del_filtered_rows:0, is_shortcut_compaction:0, is_manual_compaction:0, progress:0
I0409 14:43:49.710538 3453519 size_tiered_compaction_policy.cpp:331] pick tablet 1314182 for size-tiered compaction rowset version=0-2217 score=7.06857 level_size=149959577 total_size=155100903 segment_num=3 force_base_compaction=1 reached_max_versions=0
I0409 14:44:03.473343 2075430 snapshot_manager.cpp:130] make full snapshot tablet:1314182 cur_version:2217 req_version:0 timeout:180
W0409 14:44:03.473466 2075430 rowset.cpp:350] Fail to link /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat to /srv/BigData/data1/starrocks/be/storage/snapshot/20240409144403.211300.180/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory [2]
W0409 14:44:03.473554 2075430 agent_server.cpp:514] fail to make_snapshot. tablet_id:1314182 msg:Runtime error: Fail to link segment data file
I0409 14:44:48.432811 2935803 snapshot_manager.cpp:130] make full snapshot tablet:1314182 cur_version:2217 req_version:0 timeout:180
W0409 14:44:48.432987 2935803 rowset.cpp:350] Fail to link /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat to /srv/BigData/data1/starrocks/be/storage/snapshot/20240409144448.211508.180/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory [2]
W0409 14:44:48.433077 2935803 agent_server.cpp:514] fail to make_snapshot. tablet_id:1314182 msg:Runtime error: Fail to link segment data file
I0409 14:45:06.759310 2935803 snapshot_manager.cpp:130] make full snapshot tablet:1314182 cur_version:2217 req_version:0 timeout:180
W0409 14:45:06.759429 2935803 rowset.cpp:350] Fail to link /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat to /srv/BigData/data1/starrocks/be/storage/snapshot/20240409144506.211580.180/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory [2]
W0409 14:45:06.759543 2935803 agent_server.cpp:514] fail to make_snapshot. tablet_id:1314182 msg:Runtime error: Fail to link segment data file
I0409 14:45:27.026134 2935803 snapshot_manager.cpp:130] make full snapshot tablet:1314182 cur_version:2217 req_version:0 timeout:180
W0409 14:45:27.026248 2935803 rowset.cpp:350] Fail to link /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat to /srv/BigData/data1/starrocks/be/storage/snapshot/20240409144527.211667.180/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory [2]
W0409 14:45:27.026358 2935803 agent_server.cpp:514] fail to make_snapshot. tablet_id:1314182 msg:Runtime error: Fail to link segment data file
I0409 14:45:48.717973 3453522 compaction_manager.cpp:87] submit task to compaction pool, task_id:4067, tablet_id:1314182, compaction_type:base, compaction_score:7.06857 for round:79169, task_queue_size:1
I0409 14:45:48.718024 3453521 compaction_task.cpp:39] start compaction. task_id:4067, tablet:1314182, algorithm:VERTICAL_COMPACTION, compaction_type:base, compaction_score:7.06857, output_version:[0-2217], input rowsets size:3
W0409 14:45:48.718164 3453521 vertical_compaction_task.cpp:214] reader get next error. tablet=1314182, err=Not found: /srv/BigData/data1/starrocks/be/storage/data/43/1314182/704877449/02000000000a3ae0eb4f5a80ea04e62eda549eab116f14b3_0.dat: No such file or directory
W0409 14:45:48.718192 3453521 compaction_task.cpp:182] compaction task:4067, tablet:1314182 failed.

使用命令SHOW PROC ‘/cluster_balance/history_tablets’;查看tablet的balance记录。
显示ERRMsg : make snapshot failed. backend_ip: 节点B。

最后通过手动执行ADMIN SET REPLICA STATUS PROPERTIES(‘tablet_id’ = ‘1315169’, ‘backend_id’ = ‘10007’, ‘status’ = ‘bad’) ;
副本然后表才能正常查,直到查询到下一个不存在路径,再去看对应日志,找到backend ip,再手动置tablet为坏的。
但是集群很多这种tablet,没法自动恢复。