【详述】
statistics.column_statistics表出现了5个unhealthy的tablet:
均集中于其中一个BE上
执行了这个语句之后:ADMIN SET REPLICA STATUS PROPERTIES(“tablet_id” = “12726”, “backend_id” = “11005”, “status” = “bad”);
tablet副本没能自动修复,反而一直报错:
W0614 02:59:43.523180 78682 tablet_manager.cpp:960] Fail to remove or move /data3/starrocks/storage/data/4/12726 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:12726 submit apply task failed: Runtime error: Could not create threa
d: Resource temporarily unavailable tablet:12726 #version:2 [4152501 4152501@0 4152501.1] pending: rowsets:3
W0614 02:59:43.523231 78682 tablet_manager.cpp:998] before adding new cloned tablet, delete stale TABLET_SHUTDOWN tablet failed after 0 times retry, tablet:12726 st:Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:12726 submit apply
task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:12726 #version:2 [4152501 4152501@0 4152501.1] pending: rowsets:3
W0614 02:59:43.523257 78682 agent_task.cpp:321] clone failed. signature: 12726
后来我TRUNCATE了这张表:TRUNCATE table statistics.column_statistics; 然后正常了一段时间,
后面又出现了错误,还是这一张表,不过BE换了一个了
然后又修复了一下
show tablet 5783355
SHOW PROC ‘/dbs/10002/12716/partitions/5783350/12717/5783355’;
ADMIN SET REPLICA STATUS PROPERTIES(“tablet_id” = “5783355”, “backend_id” = “11001”, “status” = “bad”);
依旧不起作用,BE的日志也一直在报 clone failed,反复多次,tablet一直不能修复:
W0614 03:48:26.566821 56042 tablet_manager.cpp:910] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit apply task failed: Runtime error: Could not create t
hread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.579227 56010 tablet_manager.cpp:960] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit apply task failed: Runtime error: Could not create t
hread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.581297 56010 tablet_manager.cpp:998] before adding new cloned tablet, delete stale TABLET_SHUTDOWN tablet failed after 0 times retry, tablet:5783355 st:Internal error: get_applied_rowsets failed, tablet updates is in error state: tablet:5783355 submit a
pply task failed: Runtime error: Could not create thread: Resource temporarily unavailable tablet:5783355 #version:1237 [1 1234@1235 1235] pending: rowsets:554
W0614 03:50:10.581725 56010 agent_task.cpp:321] clone failed. signature: 5783355
W0614 03:51:27.044621 56042 tablet_manager.cpp:910] Fail to remove or move /data4/starrocks/storage/data/8/5783355 :Internal error: get_applied_rowsets failed, table
BE状态都正常,心跳也在更新
后来这个表又多了5个unhealthy的tablet,但是过一会就恢复了。之前的5个还是一直报错
集群其他表也有这种情况,但是能恢复成功;只有统计表不行。
【背景】
重启过BE,之前重启的时候换了用户,提示Permission denied,后来加了权限。
也加了几块磁盘
ulimit -u 和 ulimit -n 的值都挺大的,13万左右
【业务影响】
【StarRocks版本】
2.5.3
【集群规模】例如:3fe(3 follower)+3be(fe与be混部)
【机器信息】
【联系方式】
【附件】