Tablet处于unhealthy且一直不能恢复

【详述】
statistics.column_statistics表出现了5个unhealthy的tablet:

均集中于其中一个BE上

执行了这个语句之后:ADMIN SET REPLICA STATUS PROPERTIES(“tablet_id” = “12726”, “backend_id” = “11005”, “status” = “bad”);

tablet副本没能自动修复,反而一直报错:
W0614 02:59:43.523180 78682 tablet_manager.cpp:960] Fail to remove or move /data3/starrocks/storage/data/4/12726 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:12726 submit apply task failed: Runtime error: Could not create threa
d: Resource temporarily unavailable tablet:12726 #version:2 [4152501 4152501@0 4152501.1] pending: rowsets:3
W0614 02:59:43.523231 78682 tablet_manager.cpp:998] before adding new cloned tablet, delete stale TABLET_SHUTDOWN tablet failed after 0 times retry, tablet:12726 st:Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:12726 submit apply
task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:12726 #version:2 [4152501 4152501@0 4152501.1] pending: rowsets:3
W0614 02:59:43.523257 78682 agent_task.cpp:321] clone failed. signature: 12726

后来我TRUNCATE了这张表:TRUNCATE table statistics.column_statistics; 然后正常了一段时间,
后面又出现了错误,还是这一张表,不过BE换了一个了

然后又修复了一下

show tablet 5783355

SHOW PROC ‘/dbs/10002/12716/partitions/5783350/12717/5783355’;

ADMIN SET REPLICA STATUS PROPERTIES(“tablet_id” = “5783355”, “backend_id” = “11001”, “status” = “bad”);

依旧不起作用,BE的日志也一直在报 clone failed,反复多次,tablet一直不能修复:

W0614 03:48:26.566821 56042 tablet_manager.cpp:910] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit apply task failed: Runtime error: Could not create t
hread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.579227 56010 tablet_manager.cpp:960] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit apply task failed: Runtime error: Could not create t
hread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.581297 56010 tablet_manager.cpp:998] before adding new cloned tablet, delete stale TABLET_SHUTDOWN tablet failed after 0 times retry, tablet:5783355 st:Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit a
pply task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.581725 56010 agent_task.cpp:321] clone failed. signature: 5783355
W0614 03:51:27.044621 56042 tablet_manager.cpp:910] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, table

BE状态都正常,心跳也在更新

后来这个表又多了5个unhealthy的tablet,但是过一会就恢复了。之前的5个还是一直报错

集群其他表也有这种情况,但是能恢复成功;只有统计表不行。

【背景】
重启过BE,之前重启的时候换了用户,提示Permission denied,后来加了权限。
也加了几块磁盘

ulimit -u 和 ulimit -n 的值都挺大的,13万左右

【业务影响】

【StarRocks版本】
2.5.3

【集群规模】例如:3fe(3 follower)+3be(fe与be混部)
【机器信息】
【联系方式】

【附件】

这种不健康副本sr会自己修复,无需手动set bad。请确认下beINFO日志中是否有permission denied,进程启动用户还有部署用户需要是同一用户,chmod +R 修改下错误用户启动的进程

1赞

检查了,INFO和WARN日志中都没有 permission denied 相关日志,权限之前都赋过了

最后在社区大佬的建议下,重启BE解决了

线程数限制配置的比较低(4096),一旦到达线程数限制,apply无法恢复,需要调下ulimit -u 的线程数限制。