Be在上线/下线(扩所容)的时候频繁宕机

Starrocks 3.3.0的存算一体的版本
集群规模:3 fe + 13 BE (不混部)
问题:
使用decomission be的时候,be会在迁移一部分数据后就宕机。
重新加入的时候,迁入一部分数据就会宕机。下面是be.info的日志:

E1012 05:25:48.759169 1898528 daemon.cpp:243] got signal: Terminated from pid: -1330257056, is going to exit
I1012 05:25:48.760102 1898528 starrocks_be.cpp:293] BE exit step 1: wait exec engine tasks finish successfully
I1012 05:25:48.761581 1898528 starrocks_be.cpp:298] BE exit step 2: heartbeat server exit successfully
I1012 05:25:48.762310 1898528 server.cpp:1129] Server[starrocks::LakeServiceImpl+starrocks::BackendInternalServiceImplstarrocks::PInternalService+starrocks::BackendInternalServiceImpldoris::PBackendService] is going to quit
I1012 05:25:49.628741 1898528 starrocks_be.cpp:306] BE exit step 3: daemon threads exit successfully
I1012 05:26:04.536669 1899060 starlet.cc:103] Empty starmanager address, skip reporting!
I1012 05:26:07.934809 1898957 runtime_filter_worker.cpp:956] RuntimeFilterWorker going to exit.
I1012 05:26:08.825393 1898959 profile_report_worker.cpp:124] ProfileReportWorker going to exit.
I1012 05:26:09.831934 1898960 result_buffer_mgr.cpp:173] result buffer manager cancel thread finish.
I1012 05:26:11.151703 1898528 starrocks_be.cpp:309] BE exit step 4: exec engine destroy successfully
I1012 05:26:13.044173 1898528 starrocks_be.cpp:312] BE exit step 5: storage engine exit successfully
I1012 05:26:13.146310 1898528 staros_worker.cpp:400] Executing starlet shutdown hooks …
I1012 05:26:13.146342 1898528 starrocks_be.cpp:316] BE exit step 6: staros worker exit successfully
I1012 05:26:13.855295 1898528 starrocks_be.cpp:322] BE exit step 7: datacache shutdown successfully
I1012 05:26:13.860474 1898528 starrocks_be.cpp:328] BE exit step 8: http server exit successfully
I1012 05:26:13.862561 1898528 starrocks_be.cpp:332] BE exit step 9: brpc server exit successfully
I1012 05:26:13.862758 1898528 starrocks_be.cpp:336] BE exit step 10: thrift server exit successfully
I1012 05:26:15.691627 1898682 fragment_mgr.cpp:584] FragmentMgr cancel worker is going to exit.
I1012 05:26:15.830742 1898528 starrocks_be.cpp:339] BE exit step 11: exec env destroy successfully
I1012 05:26:15.833289 1898528 starrocks_be.cpp:347] BE exit step 12: global env stop successfully
I1012 05:26:15.833302 1898528 starrocks_be.cpp:351] BE exited successfully
I1012 05:26:15.842368 1898528 priority_thread_pool.cpp:110] join threads for scheduler
I1012 05:26:15.847146 1898528 priority_thread_pool.cpp:114] join threads for scheduler successful
I1012 05:26:15.847829 1898528 lru_container.cpp:423] Successfully prune container, clean 0 entries.
I1012 05:26:15.847875 1898528 lru_container.cpp:423] Successfully prune container, clean 0 entries.

be.out日志也发下

1赞

看着是什么进程BE发了kill

1赞

@许秀不许秀 @trueeyu
感谢回复,问题已经找到了,是ssh session把这个be进程给kill掉了。

你好,我们也出现了类似的问题,ssh session时怎么kill掉be的呢?是有人为去kil be么还是说有其他逻辑去kill的,能详细说一下么?非常感谢

我没有研究具体的逻辑,感觉就是ssh session超时kill掉的。就是你在这个session期间start的starrocks,然后session超时就会kill掉。需要使用exit正常退出就可以解决了