be全部挂掉

有18个BE的节点,在某个时刻全部挂掉了。 版本2.0.7

be.out

be.out (214.5 KB)

be.WANR
W0902 15:30:34.275621 9962 compaction.cpp:190] fail to execute compaction: Memory of process exceed limit. Compaction Used: 112367584
377, Limit: 107643243478. Mem usage has exceed the limit of BE
W0902 15:30:34.319694 22830 data_stream_sender.cpp:147] fail to send brpc batch, error=Host is down, error_text=[E32]Fail to keep-writ
e into Socket{id=769 fd=706 addr=10.65.23.216:8060:36452} (0x0xe050200): Broken pipe [R1][E112]Not connected to 10.65.23.216:8060 yet,
server_id=769 [R2][E112]Not connected to 10.65.23.216:8060 yet, server_id=769 [R3][E112]Not connected to 10.65.23.216:8060 yet, serve
r_id=769
W0902 15:30:34.321411 9893 data_stream_sender.cpp:147] fail to send brpc batch, error=Host is down, error_text=[E32]Fail to keep-writ
e into Socket{id=769 fd=706 addr=10.65.23.216:8060:36452} (0x0xe050200): Broken pipe [R1][E112]Not connected to 10.65.23.216:8060 yet,
server_id=769 [R2][E112]Not connected to 10.65.23.216:8060 yet, server_id=769 [R3][E112]Not connected to 10.65.23.216:8060 yet, serve
r_id=769
W0902 15:30:34.384864 9962 compaction.cpp:74] fail to do base compaction. res=Memory limit exceeded: Memory of process exceed limit.
Compaction Used: 112367584377, Limit: 107643243478. Mem usage has exceed the limit of BE, tablet=18441.1946187683.4447b8dc6be31907-2e7
8e2096ae0dca5, output_version=0-450169
W0902 15:30:34.384909 9962 storage_engine.cpp:661] failed to init vectorized base compaction. res=Memory limit exceeded: Memory of pr
ocess exceed limit. Compaction Used: 112367584377, Limit: 107643243478. Mem usage has exceed the limit of BE, table=18441.1946187683.4
447b8dc6be31907-2e78e2096ae0dca5
W0902 15:30:34.546558 10104 input_messenger.cpp:214] Fail to read from Socket{id=772 fd=781 addr=10.65.23.215:8060:46182} (0xe050800):
Connection reset by peer [104]

请问可能的原因是什么?

dmesg -T | tail -10看看

看out日志里面都是tcmalloc,可能是OOM了

[Thu Aug 11 16:40:17 2022] input: QEMU QEMU USB Tablet as /devices/pci0000:00/0000:00:01.2/usb1/1-1/1-1:1.0/input/input4
[Thu Aug 11 16:40:17 2022] hid-generic 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Pointer [QEMU QEMU USB Tablet] on usb-0000:00:01.2-1/input0
[Thu Aug 11 16:40:17 2022] random: crng init done
[Thu Aug 11 16:41:09 2022] AliSecGuard: loading out-of-tree module taints kernel.
[Thu Aug 11 16:41:09 2022] AliSecGuard: module verification failed: signature and/or required key missing - tainting kernel
[Thu Aug 11 16:47:50 2022] EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)
[Thu Aug 11 16:47:50 2022] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
[Thu Aug 11 16:47:50 2022] EXT4-fs (vdf): mounted filesystem with ordered data mode. Opts: (null)
[Thu Aug 11 16:47:50 2022] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null)
[Thu Aug 11 16:47:50 2022] EXT4-fs (vde): mounted filesystem with ordered data mode. Opts: (null)

看输出,没有提示OOM

日期太久了,dmesg -T | grep “starrocks_be” 这样看下呢

运行这条命令,输出结果为空

请问挂掉的时候集群在做什么任务呢?根据be.out的堆栈判断原因是内存达到硬限但是没有TryCatch std::bad_alloc 导致 Crash,这个问题还得深入排查下

BE机器内存有多大?

升级到2.1的最新版本吧,对Compaction有优化,BE挂了几块盘?有列比较多的表吗? 是否曾经发起过一次大的导入?broker load?留个联系方式,详细沟通下?

经初步排查,是一个大的查询将内存都占用了。 将这个查询去掉后,后面就正常了。