starocks 2.5 模拟磁盘损坏,副本未自动修复

【详述】starocks 2.5 模拟磁盘损坏,副本未自动修复,需要手动执行 ADMIN SET REPLICA STATUS PROPERTIES(“tablet_id” = “”, “backend_id” = “”, “status” = “bad”) 才能修复
【背景】 1. 覆盖be 数据文件内容为空


2. 查询如果命中该副本,报错

【业务影响】 部分查询报错
【是否存算分离】
【StarRocks版本】例如:2.5.10
【集群规模】例如:3fe(3 follower)+3be(fe与be混部)
【机器信息】
【联系方式】社区群7-n1
【附件】

  • 查询报错be 日志: W1226 11:47:53.917090 95250 rowset.cpp:141] Fail to open /mnt/data2/be/storage/data/9/39853/913390453/020000000000070e154b95ef6731ce922253ca3003f77d9f_0.dat: Corruption: Bad segment file /mnt/data2/doris/be/storage/data/9/39853/913390453/020000000000070e154b95ef6731ce922253ca3003f77d9f_0.dat: file size 1 < 12
  • tabletchecker:

help 求大佬指教

leader fe日志里面搜下这个tablet的日志

没有其他信息
cat /mnt/data1/doris/log/fe/fe.log|grep 39853
2024-01-08 00:23:40,753 INFO (tablet scheduler|39) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 1.5398537745410739, low/mid/high: 2/0/1
2024-01-08 00:24:00,760 INFO (tablet scheduler|39) [ClusterLoadStatistic.classifyBackendByLoad():149] classify backend by load. medium: HDD, avg load score: 1.5398537745410739, low/mid/high: 2/0/1
be id: 10065, is available: true, mediums: [{medium: HDD, replica: 1089, used: 12795753938, total: 298.8GB, score: 2.84299080956585},{medium: SSD, replica: 0, used: 0, total: 0B, score: NaN},], paths: [{path: /mnt/data1/doris/be/storage, path hash: -3355659092685903090, be: 10065, medium: HDD, used: 2137448016, total: 79153491536},{path: /mnt/data0/doris/be/storage, path hash: 999321683415977915, be: 10065, medium: HDD, used: 2296803529, total: 79312847049},{path: /mnt/data2/doris/be/storage, path hash: 983070084747692929, be: 10065, medium: HDD, used: 4097996333, total: 81114039853},{path: /mnt/data3/doris/be/storage, path hash: 4624017804433583482, be: 10065, medium: HDD, used: 4263506060, total: 81279549580},]

2024-01-08 11:59:58,549 INFO (thrift-server-pool-296317|309887) [QeProcessorImpl.reportExecStatus():130] ReportExecStatus() failed, query does not exist, fragment_instance_id=63e26137-adda-11ee-a758-0242f4776d02, query_id=63e26137-adda-11ee-a758-0242f4776cfe,
2024-01-08 11:59:58,550 INFO (thrift-server-pool-296318|309888) [QeProcessorImpl.reportExecStatus():130] ReportExecStatus() failed, query does not exist, fragment_instance_id=63e26137-adda-11ee-a758-0242f4776cff, query_id=63e26137-adda-11ee-a758-0242f4776cfe,
2024-01-08 12:00:04,562 INFO (ReportHandler|146) [ReportHandler.tabletReport():337] backend[10063] reports 1090 tablet(s). report version: 17035835170157
2024-01-08 12:00:04,564 INFO (ReportHandler|146) [TabletInvertedIndex.tabletReport():288] finished to do tablet diff with backend[10063]. sync: 0. metaDel: 0. foundValid: 1090. foundInvalid: 0. migration: 0. found invalid transactions 0. found republish transactions 0 cost: 1 ms
2024-01-08 12:00:05,052 INFO (colocate group clone checker|94) [ColocateTableBalancer.matchGroups():878] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2024-01-08 12:00:05,201 INFO (tablet checker|40) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: true, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
2024-01-08 12:00:05,203 INFO (tablet checker|40) [TabletChecker.doCheck():411] finished to check tablets. isUrgent: false, unhealthy/total/added/in_sched/not_ready: 0/1103/0/0/0, cost: 1 ms, in lock time: 1 ms, wait time: 0ms
2024-01-08 12:00:05,203 INFO (tablet checker|40) [TabletChecker.runAfterCatalogReady():202] TStat :
TStat num of tablet check round: 365259 (+1)
TStat cost of tablet check(ms): 909720 (+1)
TStat num of tablet checked in tablet checker: 1917516142 (+1103)
TStat num of unhealthy tablet checked in tablet checker: 1050836 (+0)
TStat num of tablet being added to tablet scheduler: 1087 (+0)
TStat num of tablet schedule round: 7302909 (+20)
TStat cost of tablet schedule(ms): 99913 (+0)
TStat num of tablet being scheduled: 107588 (+0)
TStat num of tablet being scheduled succeeded: 2142 (+0)
TStat num of tablet being scheduled failed: 102466 (+0)
TStat num of tablet being scheduled discard: 0 (+0)
TStat num of tablet priority upgraded: 0 (+0)
TStat num of clone task: 2119 (+0)
TStat num of clone task succeeded: 2119 (+0)
TStat num of clone task failed: 0 (+0)
TStat num of clone task timeout: 0 (+0)
TStat num of replica missing error: 70867 (+0)
TStat num of replica version missing error: 0 (+0)
TStat num of replica unavailable error: 0 (+0)
TStat num of replica redundant error: 43 (+0)
TStat num of replica missing in cluster error: 0 (+0)
TStat num of balance scheduled: 36678 (+0)
TStat num of colocate replica mismatch: 0 (+0)
TStat num of colocate replica redundant: 0 (+0)
TStat num of colocate balancer running round: 0 (+0)
查询报错:
@jingdan

兄弟你是怎么模拟磁盘损坏的?我的集群受到磁盘损坏影响,be节点假死且无法正常退出,也想模拟一下

  1. 直接删除文件
  2. 利用 echo 清空文件内容

可以单开个帖子详细描述下您的问题么,集群版本,几个FE BE,单BE 几块盘,是否配置了守护进程

已重新发布 starocks 2.5.10 模拟磁盘损坏,副本无法自动修复

我尝试删除掉storage,发现sr又自动创建了,又尝试把storage所在目录设置为禁止访问,be还是没挂掉,只是无法对表做插入或者truncate操作了,没有成功模拟到be节点无法退出的情况。