be的一个节点突然被系统 kill, Out of memory: Kill process 14832 (jemalloc_bg_thd) score 676 or sacrifice child

【详述】be的一个节点突然被系统 kill,Out of memory: Kill process 14832 (jemalloc_bg_thd) score 676 or sacrifice child
【背景】无
【业务影响】
【StarRocks版本】2.4.3
【集群规模】5fe(3 follower+2observer)+6be(独立部署)
【机器信息】16c+ 64g
【联系方式】社区6群-春江
【附件】

be.out

系统日志:

Apr  3 13:33:09 VM-0-2-centos kernel: [14831]     0 14831 25521830 11331827   31171        0             0 starrocks_be
Apr  3 13:33:09 VM-0-2-centos kernel: Out of memory: Kill process 14831 (starrocks_be) score 676 or sacrifice child
Apr  3 13:33:09 VM-0-2-centos kernel: Killed process 14831 (starrocks_be), UID 0, total-vm:102087320kB, anon-rss:45327184kB, file-rss:124kB, shmem-rss:0kB
Apr  3 13:33:09 VM-0-2-centos kernel: YDService invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr  3 13:33:09 VM-0-2-centos kernel: YDService cpuset=/ mems_allowed=0-1
Apr  3 13:33:09 VM-0-2-centos kernel: CPU: 5 PID: 13164 Comm: YDService Kdump: loaded Tainted: G        W      ------------   3.10.0-1160.66.1.el7.x86_64 #1
Apr  3 13:33:09 VM-0-2-centos kernel: Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014
Apr  3 13:33:09 VM-0-2-centos kernel: Call Trace:
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa15865a9>] dump_stack+0x19/0x1b
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1581648>] dump_header+0x90/0x229
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0f06ac2>] ? ktime_get_ts64+0x52/0xf0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0f5e17f>] ? delayacct_end+0x8f/0xb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc258d>] oom_kill_process+0x2cd/0x490
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0f336f1>] ? cpuset_mems_allowed_intersects+0x21/0x30
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc2c7a>] out_of_memory+0x31a/0x500
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc9874>] __alloc_pages_nodemask+0xad4/0xbe0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa10193b8>] alloc_pages_current+0x98/0x110
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fbe037>] __page_cache_alloc+0x97/0xb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc0fe0>] filemap_fault+0x270/0x420
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffc050e756>] ext4_filemap_fault+0x36/0x50 [ext4]
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fee7aa>] __do_fault.isra.61+0x8a/0x100
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0feed5c>] do_read_fault.isra.63+0x4c/0x1b0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0ff65a0>] handle_mm_fault+0xa20/0xfb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1594653>] __do_page_fault+0x213/0x500
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1594a26>] trace_do_page_fault+0x56/0x150
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1593fa2>] do_async_page_fault+0x22/0xf0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa15907a8>] async_page_fault+0x28/0x30
Apr  3 13:33:09 VM-0-2-centos kernel: Mem-Info:

。。。


Apr  3 13:33:09 VM-0-2-centos kernel: [  441]     0   441  6945614  4513913    9168        0             0 java
Apr  3 13:33:09 VM-0-2-centos kernel: [14855]     0 14831 25521830 11342843   31171        0             0 jemalloc_bg_thd
Apr  3 13:33:09 VM-0-2-centos kernel: Out of memory: Kill process 14832 (jemalloc_bg_thd) score 676 or sacrifice child
Apr  3 13:33:09 VM-0-2-centos kernel: Killed process 14855 (jemalloc_bg_thd), UID 0, total-vm:102087320kB, anon-rss:45370244kB, file-rss:1128kB, shmem-rss:0kB
Apr  3 13:33:09 VM-0-2-centos kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr  3 13:33:09 VM-0-2-centos kernel: java cpuset=/ mems_allowed=0-1
Apr  3 13:33:09 VM-0-2-centos kernel: CPU: 9 PID: 461 Comm: java Kdump: loaded Tainted: G        W      ------------   3.10.0-1160.66.1.el7.x86_64 #1
Apr  3 13:33:09 VM-0-2-centos kernel: Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014
Apr  3 13:33:09 VM-0-2-centos kernel: Call Trace:
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa15865a9>] dump_stack+0x19/0x1b
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1581648>] dump_header+0x90/0x229
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0f06ac2>] ? ktime_get_ts64+0x52/0xf0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0f5e17f>] ? delayacct_end+0x8f/0xb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc258d>] oom_kill_process+0x2cd/0x490
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc2c7a>] out_of_memory+0x31a/0x500
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc9874>] __alloc_pages_nodemask+0xad4/0xbe0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa10193b8>] alloc_pages_current+0x98/0x110
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fbe037>] __page_cache_alloc+0x97/0xb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fc0fe0>] filemap_fault+0x270/0x420
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffc050e756>] ext4_filemap_fault+0x36/0x50 [ext4]
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0fee7aa>] __do_fault.isra.61+0x8a/0x100
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0feed5c>] do_read_fault.isra.63+0x4c/0x1b0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa0ff65a0>] handle_mm_fault+0xa20/0xfb0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1594653>] __do_page_fault+0x213/0x500
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1594a26>] trace_do_page_fault+0x56/0x150
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa1593fa2>] do_async_page_fault+0x22/0xf0
Apr  3 13:33:09 VM-0-2-centos kernel: [<ffffffffa15907a8>] async_page_fault+0x28/0x30
Apr  3 13:33:09 VM-0-2-centos kernel: Mem-Info:
Apr  3 13:33:09 VM-0-2-centos kernel: active_anon:15888426 inactive_anon:87 isolated_anon:28#012 active_file:0 inactive_file:5890 isolated_file:153#012 unevictable:2589 dirty:34 writeback:2 unstable:0#012 slab_reclaimable:68101 slab_unreclaimable:15434#012 mapped:2442 shmem:175 pagetables:41725 bounce:0#012 free:95527 free_pcp:828 free_cma:0
Apr  3 13:33:09 VM-0-2-centos kernel: Node 0 DMA free:15892kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes


系统监控

请问集群是否有使用udf函数呢?

好像有几个 udf ,是udf 导致的吗

可能是,但不确定,方便看看你的udf函数吗?

大佬,我觉得应该不是,be 挂的那个时间没有 带 udf 的函数在执行;
我大概知道原因了,上周刚刚扩容,要把 fe+ be独立部署,还没完全切换完,这次挂的是仅剩的一个混部节点,因为 fe 还有用户client 在连,就还没下,但是 be 的内存限制参数(mem_limit)已经取消了 ,fe 虽然查询已经很少了,但是还占 16G 的内存,be 高峰的时候就 oom ,被系统kill 了

所以您的意思是这台挂掉的be其实是和fe混合部署的,然后be.conf里面没有做mem_limit的限制对吧?

是的,以为已经下线了。。。

好的,那您这边之后如果有OOM的情况,请先注意下是否有混合部署、mem_limit没设置大小的情况,避免再次出现OOM。

不会了,现在都是独立部署的了。感谢大佬