【BE OOM】 设置mem_limit 85%后,be两个节点因OOM 进程被系统kill

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】be设置mem_limit = 85%, 但有两个节点的进程因OOM,被系统kill;影响业务稳定性
【背景】日常使用,没有任何异常操作
【业务影响】稳定性
【是否存算分离】否
【StarRocks版本】2.5.22
【集群规模】3fe(1 follower+2observer)+ 11be(fe与be分离部署)
【机器信息】CPU虚拟核/内存/网卡,32C/128G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式, 社区群12-金谡 jinsu@moojing.com,谢谢
【附件】

  • fe.log/beINFO/相应截图
    be.info日志:
    W1023 13:51:12.163887 988231 pipeline_driver.cpp:632] fragment_id c7c9fc78-9102-11ef-acea-52540029b191 driver query_id=c7c9fc78-9102-11ef-acea-52540029b16c fragment_id=c7c9fc78-9102-11ef-acea-52540029b191 driver=driver_6_13, status=INPUT_EMPTY, operator-chain: [exchange_source_6_0x7fa9c0914490(O) -> hash_join_build_7_0x7fa9c0914e90(X)(HashJoiner=0x7fb468582310)] cancels operator hash_join_build_7_0x7fa9c0914e90(X)(HashJoiner=0x7fb468582310) with finished error runtime state is cancelled
    W1023 13:51:12.165146 988231 pipeline_driver.cpp:632] fragment_id c7c9fc78-9102-11ef-acea-52540029b191 driver query_id=c7c9fc78-9102-11ef-acea-52540029b16c fragment_id=c7c9fc78-9102-11ef-acea-52540029b191 driver=driver_6_2, status=INPUT_EMPTY, operator-chain: [exchange_source_6_0x7fae3ff01e10(O) -> hash_join_build_7_0x7fae447cd910(X)(HashJoiner=0x7fb4f6e0e410)] cancels operator hash_join_build_7_0x7fae447cd910(X)(HashJoiner=0x7fb4f6e0e410) with finished error runtime state is cancelled
    W1023 13:51:12.169059 988231 pipeline_driver.cpp:632] fragment_id c7c9fc78-9102-11ef-acea-52540029b191 driver query_id=c7c9fc78-9102-11ef-acea-52540029b16c fragment_id=c7c9fc78-9102-11ef-acea-52540029b191 driver=driver_6_0, status=INPUT_EMPTY, operator-chain: [exchange_source_6_0x7fae3e217b90(O) -> hash_join_build_7_0x7fae3e218590(X)(HashJoiner=0x7fb4f6e0d210)] cancels operator hash_join_build_7_0x7fae3e218590(X)(HashJoiner=0x7fb4f6e0d210) with finished error runtime state is cancelled
    W1023 13:51:12.178531 988231 pipeline_driver.cpp:632] fragment_id c7c9fc78-9102-11ef-acea-52540029b191 driver query_id=c7c9fc78-9102-11ef-acea-52540029b16c fragment_id=c7c9fc78-9102-11ef-acea-52540029b191 driver=driver_6_15, status=INPUT_EMPTY, operator-chain: [exchange_source_6_0x7fb0002ef690(O) -> hash_join_build_7_0x7fb0002f0090(X)(HashJoiner=0x7fb468582910)] cancels operator hash_join_build_7_0x7fb0002f0090(X)(HashJoiner=0x7fb468582910) with finished error runtime state is cancelled
    W1023 13:51:12.185879 988231 pipeline_driver.cpp:632] fragment_id c7c9fc78-9102-11ef-acea-52540029b191 driver query_id=c7c9fc78-9102-11ef-acea-52540029b16c fragment_id=c7c9fc78-9102-11ef-acea-52540029b191 driver=driver_6_12, status=INPUT_EMPTY, operator-chain: [exchange_source_6_0x7fa9c0912b90(O) -> hash_join_build_7_0x7fa9c0913590(X)(HashJoiner=0x7fb4b2382a10)] cancels operator hash_join_build_7_0x7fa9c0913590(X)(HashJoiner=0x7fb4b2382a10) with finished error runtime state is cancelled

dmesg日志:
root@sr-clst2-be-2521-6:/data/log/starrocks# dmesg -T|grep -i oom
[Thu Oct 10 14:17:27 2024] pip_wg_executor invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[Thu Oct 10 14:17:27 2024] oom_kill_process.cold+0xb/0x10
[Thu Oct 10 14:17:27 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Thu Oct 10 14:17:27 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-76652.scope,task=starrocks_be,pid=874634,uid=0
[Thu Oct 10 14:17:27 2024] Out of memory: Killed process 874634 (starrocks_be) total-vm:1926000692kB, anon-rss:123440280kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:2666096kB oom_score_adj:0
[Thu Oct 10 14:17:32 2024] oom_reaper: reaped process 874634 (starrocks_be), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Wed Oct 23 13:50:49 2024] pip_wg_executor invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[Wed Oct 23 13:50:49 2024] oom_kill_process.cold+0xb/0x10
[Wed Oct 23 13:50:49 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Oct 23 13:50:49 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-103109.scope,task=starrocks_be,pid=987888,uid=0
[Wed Oct 23 13:50:49 2024] Out of memory: Killed process 987888 (starrocks_be) total-vm:1418698488kB, anon-rss:122831052kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:2083968kB oom_score_adj:0
[Wed Oct 23 13:50:54 2024] oom_reaper: reaped process 987888 (starrocks_be), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

  • 慢查询:
    • Profile信息
    • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
    • pipeline是否开启:show variables like ‘%pipeline%’;
    • be节点cpu和内存使用率截图
  • 查询报错:
  • be crash
  • 外表查询报错
    • be.out和fe.warn.log

dmesg.log (261.5 KB) be.info.log.zip (953.3 KB)