大量SELECT、ISNERT卡住、超时

【故障表现】所有SQL任务卡住,执行最简单的insert select xxxx limit 1 ; 需要五分钟成功返回
【业务影响】目前已解决
【StarRocks版本】2.5.5
【集群规模】例如:3fe +15be
【联系方式】社区群3 - 所以一直搁浅
【详述】在上午11点左右一台BE-07节点运维不小心直接关机,机器重启之后be服务自动被supervisor拉起
然后过了大概一二十分钟,开始收到大量任务报错,发现都是执行超时。
【尝试过的解决方案】
1. 所有be节点滚动重启一遍,无法解决
2. 所有fe节点滚动重启 ,无法解决。任务还是一直超时
3. show proc /statistic ; 无不健康副本 show proc transactions;
4. fe、be log排查 lock 相关日志:不存在
5. 重启 BE-07 机器: 不管用
6. 将BE-07上的BE服务停止,SQL大概过了几分钟任务可以正常执行
7. 将BE-07 drop出集群,计划重新安装加入集群

【附件】
be.warning中大量如下日志:

当时重启后那段时间的BE日志还有吗

补充当时现象以及BE日志:

be.log.zip (27.6 MB)

补充资料,刚又复现相同情况,集群中运行了一个大的SQL导致某台节点负载过高重启之后出现大量如下日志,SQL运行缓慢:
–目前解决方案就是将问题BE节点停掉 SQL就可以运行正常


查看backends状态以及tablet状态都正常:


日志:

60089/220688835/02000000dc632961554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory
W0711 04:44:01.159651 23886 rowset.cpp:236] Fail to delete /data_ssd1/sr_storage/data/212/134560089/220688835/02000000dc560d61554a384979c46092f1dd4b9951abf182_0.dat: Not found: /data_ssd1/sr_storage/data/212/1345
60089/220688835/02000000dc560d61554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory
W0711 04:44:01.159669 23886 rowset.cpp:236] Fail to delete /data_ssd/sr_storage/data/551/134540462/341682761/02000000dc5a87ca554a384979c46092f1dd4b9951abf182_0.dat: Not found: /data_ssd/sr_storage/data/551/134540
462/341682761/02000000dc5a87ca554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory
W0711 04:44:01.159706 23886 rowset.cpp:236] Fail to delete /data_ssd/sr_storage/data/989/96307542/1369391786/02000000dc6a7c65554a384979c46092f1dd4b9951abf182_0.dat: Not found: /data_ssd/sr_storage/data/989/963075
42/1369391786/02000000dc6a7c65554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory
W0711 04:44:01.159722 23886 rowset.cpp:236] Fail to delete /data_ssd/sr_storage/data/550/134540446/341682761/02000000dc56c737554a384979c46092f1dd4b9951abf182_0.dat: Not found: /data_ssd/sr_storage/data/550/134540
446/341682761/02000000dc56c737554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory
W0711 04:44:01.159726 23886 rowset.cpp:236] Fail to delete /data_ssd/sr_storage/data/550/134540446/341682761/02000000dc60414c554a384979c46092f1dd4b9951abf182_0.dat: Not found: /data_ssd/sr_storage/data/550/134540
446/341682761/02000000dc60414c554a384979c46092f1dd4b9951abf182_0.dat: No such file or directory



看下be.out,这个BE为什么重启了

又遇到了该问题目前是2.5.12版本,主键表的SQL执行超时,将报错中timeout的节点be服务停掉之后 数据就可以正常查询了
背景:当天新增了多个任务补数据写入可能比较频繁,看了下这次的BE节点没有重启。

mysql> select * From xxxxx limit 1;
ERROR 1064 (HY000): query timeout. backend id: 136708581


W1019 04:29:26.394521 265245 tablet_updates.cpp:1309] wait_for_version timeout(56002ms) version:4021 tablet:144791676 #version:868 [3157 3157@0 4023] pending: rowsets:867[id/seg/row/del/byte/compaction]: [1197/1/1395953/0/231.56 MB/24.44 MB],[1198/1/190/0/42.91 KB/255.96 MB],[1199/1/363/0/78.40 KB/255.92 MB],[1200/1/383/0/82.35 KB/255.92 MB],[1201/1/382/0/77.71 KB/255.92 MB],[1202/1/481/0/97.25 KB/255.91 MB],[1203/1/557/0/112.11 KB/255.89 MB],[1204/1/639/0/128.17 KB/255.87 MB],[1205/1/1032/0/198.49 KB/255.81 MB],[1206/1/1157/0/222.27 KB/255.78 MB],[1207/1/358/0/79.55 KB/255.92 MB]...,[2055/1/856/0/168.02 KB/255.84 MB],[2056/1/881/0/172.44 KB/255.83 MB],[2057/1/738/0/147.62 KB/255.86 MB],[2058/1/881/0/172.44 KB/255.83 MB],[2059/1/828/0/162.91 KB/255.84 MB],[2060/1/856/0/168.05 KB/255.84 MB],[2061/1/805/0/158.98 KB/255.84 MB],[2062/1/774/0/153.58 KB/255.85 MB],[2063/1/828/0/162.91 KB/255.84 MB]
W1019 04:29:26.394747 265245 internal_service.cpp:242] exec multi plan fragments failed, errmsg=wait_for_version timeout(56002ms) version:4021 tablet:144791676 #version:868 [3157 3157@0 4023] pending: rowsets:867[id/seg/row/del/byte/compaction]: [1197/1/1395953/0/231.56 MB/24.44 MB],[1198/1/190/0/42.91 KB/255.96 MB],[1199/1/363/0/78.40 KB/255.92 MB],[1200/1/383/0/82.35 KB/255.92 MB],[1201/1/382/0/77.71 KB/255.92 MB],[1202/1/481/0/97.25 KB/255.91 MB],[1203/1/557/0/112.11 KB/255.89 MB],[1204/1/639/0/128.17 KB/255.87 MB],[1205/1/1032/0/198.49 KB/255.81 MB],[1206/1/1157/0/222.27 KB/255.78 MB],[1207/1/358/0/79.55 KB/255.92 MB]...,[2055/1/856/0/168.02 KB/255.84 MB],[2056/1/881/0/172.44 KB/255.83 MB],[2057/1/738/0/147.62 KB/255.86 MB],[2058/1/881/0/172.44 KB/255.83 MB],[2059/1/828/0/162.91 KB/255.84 MB],[2060/1/856/0/168.05 KB/255.84 MB],[2061/1/805/0/158.98 KB/255.84 MB],[2062/1/774/0/153.58 KB/255.85 MB],[2063/1/828/0/162.91 KB/255.84 MB]

看起来有点类似这位老哥的问题:StarRocks查询主键表超时,主键表某个分区不可查询如何修复

大佬,这个问题是怎么解决的呢?