be进程报Reached timeout=150000ms

【详述】streamload任务会报be超时的错,type:LOAD_RUN_FAIL; msg:Cancelled, msg: [E1008]Reached timeout=150000ms @x.x.x.x:8060
去其中一个be查看相关时间点的日志,看到大量的报错如下:
W0619 12:15:04.863848 7382 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863968 7382 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863976 7382 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863880 7473 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863888 7448 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863904 7442 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.864007 7442 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863925 7458 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863934 7439 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863941 7459 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.864037 7459 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863950 7432 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.864061 7432 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
W0619 12:15:04.863870 7450 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22
我对这些日志做过线程(日志第3列是线程号)统计,发现一共64个,和我设置的brpc_num_threads=64参数吻合,所以猜测是be进程的brpc线程全部阻塞了,所以导致be节点超时?另外这些大量报错日志之前,还会伴随一些下面的报错:
W0619 12:15:01.214861 6354 async_delta_writer.cpp:36] Fail to write or commit. txn_id: 16915972 tablet_id: 46962392: Cancelled: cancel
E0619 12:15:01.214887 6354 async_delta_writer.cpp:56] Fail to write or commit. txn_id: 16915972 tablet_id: 46962401: Cancelled: cancel
W0619 12:15:01.214895 6354 async_delta_writer.cpp:36] Fail to write or commit. txn_id: 16915972 tablet_id: 46962401: Cancelled: cancel
E0619 12:15:01.214941 6354 async_delta_writer.cpp:56] Fail to write or commit. txn_id: 16915973 tablet_id: 46962392: Cancelled: cancel
W0619 12:15:01.214949 6354 async_delta_writer.cpp:36] Fail to write or commit. txn_id: 16915973 tablet_id: 46962392: Cancelled: cancel
这些报错看起来是提交事务失败了,然后触发了什么?导致大量刷W0619 12:15:04.864037 7459 async_delta_writer.cpp:144] Fail to execution_queue_execute: 22 这个日志?
另外参考过这个问题的描述( StreamLoad 导入任务经常失败报错消息 Message : [E1008]Reached timeout=120000ms @x.x.x.3:8060 ),这个问题和我遇到的很像,但是解决办法似乎行不通,因为我的fe内存看起来是足够的,fe内存趋势图如下:
图片
单台be的tablet总量3w,6台一共18w。
【背景】streamload千万级别数据导入发生以上问题的概率很大
【业务影响】be超时报错期间,查询和写入都报be timeout
【StarRocks版本】2.5.6
【集群规模】例如:3fe(1 follower+2observer)+6be(fe与be混部)

有啥后续吗哥,碰到了同情况不知道接下来往哪里分析了

我们看看。
不过,你这个tablet 18w,是每次导入都会涉及到这么tablet么?

3.1.1版本遇到了同样的问题,不知道什么原因,重启所有be节点可以恢复

后来有解决吗?我们在3.2.9上也碰到过两次了,重启FE没有用,也是重启所有BE之后恢复