StarRocks 2.1.6 下线节点过程中表不能写入

Scheduling Tablets指标积压
【详述】下线节点表不能写入
【背景】线下BE节点
【业务影响】
【StarRocks版本】:2.1.6
【集群规模】例如:3fe(3 follower)+9be
【机器信息】CPU虚拟核/内存/网卡 16C/64G/be数据盘ssd 500G
【联系方式】社区群16-JoJo
【附件】
监控看板

  • fe.warn.log
    fe.warn.log (49.2 MB)
  • be.warn.log
    be.WARNING.log (11.4 MB)
  • flink 报错
    {“Status”:“Fail”,“BeginTxnTimeMs”:55,“Message”:“intolerable failure in opening node channels”,“NumberUnselectedRows”:0,“CommitAndPublishTimeMs”:0,“Label”:“bcbcb31c-629c-4666-ba40-6e1a122a9187”,“LoadBytes”:1121894,“StreamLoadPutTimeMs”:64,“NumberTotalRows”:0,“WriteDataTimeMs”:27,“TxnId”:190408441,“LoadTimeMs”:148,“ReadDataTimeMs”:2,“NumberLoadedRows”:0,“NumberFilteredRows”:0}
    {}
    -其它信息
    下线节点过程中,其它be节点数据盘剩余空间小于10%。

be节点是怎么下线的?DROP还是DECOMMISSION?

DECOMMISSION

SHOW PROC ‘/statistic’;看下结果,是不是调整过max_scheduling_tablets的参数,scheduling tablets的监控现在到多少了,有下降吗?

SHOW PROC ‘/cluster_balance’;也麻烦看下结果

都是默认值没有调整过

scheduling tablets 一直在增长,这个时生产环境,目前写任务没有停,失败了我们就重新建表

SHOW PROC ‘/cluster_balance/history_tablets’;
多执行几次这个命令,看下最新被调度的tablet是不是都是某几个重复的tablet

history_tablets.log (389.2 KB)
去重后的 这些状态是不是有问题
13 100890 REPAIR CANCELLED
13 100894 REPAIR CANCELLED
13 100898 REPAIR CANCELLED
14 1490396 REPAIR CANCELLED
13 170088 REPAIR CANCELLED
14 170092 REPAIR CANCELLED
14 170096 REPAIR CANCELLED
14 170100 REPAIR CANCELLED
14 170104 REPAIR CANCELLED
14 170108 REPAIR CANCELLED
2 17447628 REPAIR CANCELLED
2 17447640 REPAIR CANCELLED
1 17453164 REPAIR FINISHED
1 17454904 REPAIR FINISHED
1 17454916 REPAIR FINISHED
1 17454924 REPAIR FINISHED
1 17455396 REPAIR FINISHED
1 17455404 REPAIR FINISHED
1 17455416 REPAIR FINISHED
1 17455428 REPAIR FINISHED
1 17475384 REPAIR FINISHED
1 17477560 REPAIR FINISHED
1 17477572 REPAIR FINISHED
1 17485508 REPAIR FINISHED
14 2829304 REPAIR CANCELLED
14 4477104 REPAIR CANCELLED
13 94314 REPAIR CANCELLED
13 94326 REPAIR CANCELLED
13 94378 REPAIR CANCELLED
13 94390 REPAIR CANCELLED
13 94402 REPAIR CANCELLED
14 95154 REPAIR CANCELLED
14 95166 REPAIR CANCELLED
14 95178 REPAIR CANCELLED
14 95190 REPAIR CANCELLED
14 95202 REPAIR CANCELLED
14 95214 REPAIR CANCELLED
13 95226 REPAIR CANCELLED
14 95238 REPAIR CANCELLED
14 95250 REPAIR CANCELLED
14 95262 REPAIR CANCELLED
14 95274 REPAIR CANCELLED
14 95286 REPAIR CANCELLED
14 95298 REPAIR CANCELLED
14 95310 REPAIR CANCELLED
14 95322 REPAIR CANCELLED
14 95334 REPAIR CANCELLED
14 95346 REPAIR CANCELLED
14 95358 REPAIR CANCELLED
14 95418 REPAIR CANCELLED
14 95430 REPAIR CANCELLED
14 95442 REPAIR CANCELLED
14 95454 REPAIR CANCELLED
14 95466 REPAIR CANCELLED
14 95478 REPAIR CANCELLED
14 96138 REPAIR CANCELLED
14 96150 REPAIR CANCELLED
14 96162 REPAIR CANCELLED
14 96498 REPAIR CANCELLED
14 96510 REPAIR CANCELLED
14 96522 REPAIR CANCELLED
14 96654 REPAIR CANCELLED
14 96666 REPAIR CANCELLED
14 96678 REPAIR CANCELLED
14 97170 REPAIR CANCELLED
14 97182 REPAIR CANCELLED
14 97194 REPAIR CANCELLED
14 97206 REPAIR CANCELLED
14 97254 REPAIR CANCELLED
14 97266 REPAIR CANCELLED
14 97278 REPAIR CANCELLED
14 97374 REPAIR CANCELLED
14 97386 REPAIR CANCELLED
14 97398 REPAIR CANCELLED
14 97770 REPAIR CANCELLED
14 97782 REPAIR CANCELLED
14 97794 REPAIR CANCELLED
14 97806 REPAIR CANCELLED
14 97854 REPAIR CANCELLED
14 97866 REPAIR CANCELLED
14 97878 REPAIR CANCELLED
14 97974 REPAIR CANCELLED
14 97986 REPAIR CANCELLED
14 97998 REPAIR CANCELLED
14 98658 REPAIR UNEXPECTED

98658 REPAIR UNEXPECTED 这个是不是有问题

是在fe的leader节点执行的吗?也看下cancel的tablet的状态

是leader节点

2副本的表吗?还剩一个副本,状态是bad需要修复的,这种已经无法修复了,需要先把这类副本对应的表删掉重建,重启下fe leader节点恢复

怎么快速的处理呢