be节点 cpu负载极高 然后挂掉

【详述】执行大数据量查询操作时,机器负载处于1000%状态,且语句停止执行后,资源仍未释放,直到重启be服务才可以
【背景】数据量30亿条 34GB ORC存储格式 数据存储再OSS上
【业务影响】目前只是进行小规模计算,如果上线将全业务全链路影响
【StarRocks版本】2.4
【集群规模】1fe+4be
【机器信息】16c128g
【附件】

  • Profile信息

  • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;

  • pipeline是否开启:show variables like ‘%pipeline%’;

  • 执行计划:explain costs + sql

  • be节点cpu和内存使用率截图

发下be.out

然后dmesg -T 看下是否是OOM了

确实是oom了。

BE这台机器了,除了starrocks_be,还有其它占比较多内存的进程吗?

可以把dmesg -T 的信息发下吗

cat /proc/sys/vm/overcommit_memory 这个值是多少
swap是开的,还是关的

没有,就starrocks_be。

目前这个值是1

我复现下问题,拿到日志

你把dmesg -T 信息发下就行,不需要复现

a.log (72.2 KB)

free -g 显示机器有多少内存?

先回退到2.3.4吧

加个微信,聊下这个问题?可能是与2.4开启了jemalloc相关

[Tue Nov  1 16:14:49 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Tue Nov  1 16:14:49 2022] [    556]     0   556    40786      110   360448        0             0 systemd-journal
[Tue Nov  1 16:14:49 2022] [    586]     0   586    11426      170   118784        0         -1000 systemd-udevd
[Tue Nov  1 16:14:49 2022] [    701]    81   701    15085      163   167936        0          -900 dbus-daemon
[Tue Nov  1 16:14:49 2022] [    733]    32   733    17317      134   180224        0             0 rpcbind
[Tue Nov  1 16:14:49 2022] [    734]     0   734    48805      119   163840        0             0 gssproxy
[Tue Nov  1 16:14:49 2022] [    737]     0   737     6706      198   102400        0             0 systemd-logind
[Tue Nov  1 16:14:49 2022] [    741]     0   741    22666      219   212992        0             0 rngd
[Tue Nov  1 16:14:49 2022] [    744]   999   744   153582     2132   270336        0             0 polkitd
[Tue Nov  1 16:14:49 2022] [    763]   997   763    29455      131   139264        0             0 chronyd
[Tue Nov  1 16:14:49 2022] [    764]     0   764     6480       51    98304        0             0 atd
[Tue Nov  1 16:14:49 2022] [    766]     0   766    31600      156   102400        0             0 crond
[Tue Nov  1 16:14:49 2022] [    802]     0   802    27555       33    61440        0             0 agetty
[Tue Nov  1 16:14:49 2022] [    803]     0   803    27555       33    65536        0             0 agetty
[Tue Nov  1 16:14:49 2022] [   1040]     0  1040    25750      515   221184        0             0 dhclient
[Tue Nov  1 16:14:49 2022] [   1105]     0  1105   143511     2776   417792        0             0 tuned
[Tue Nov  1 16:14:49 2022] [   1125]     0  1125   184516      466   835584        0             0 rsyslogd
[Tue Nov  1 16:14:49 2022] [   1432]     0  1432    10583      380   106496        0             0 AliYunDunUpdate
[Tue Nov  1 16:14:49 2022] [   1625]     0  1625    38941     5188   327680        0             0 AliYunDun
[Tue Nov  1 16:14:49 2022] [   1852]     0  1852    28249      259   253952        0         -1000 sshd
[Tue Nov  1 16:14:49 2022] [   3218]  1001  3218   522815    48816  1585152        0             0 clickhouse-keep
[Tue Nov  1 16:14:49 2022] [   3459]  1001  3459   139329     7351   483328        0             0 clckhouse-watch
[Tue Nov  1 16:14:49 2022] [   3475]  1001  3475  7105902   125365 46682112        0             0 clickhouse-serv
[Tue Nov  1 16:14:49 2022] [   3587]     0  3587     7378      356    86016        0             0 ilogtail
[Tue Nov  1 16:14:49 2022] [   3589]     0  3589    70239      945   184320        0             0 ilogtail
[Tue Nov  1 16:14:49 2022] [   3593]     0  3593     7379      356    90112        0             0 ilogtail
[Tue Nov  1 16:14:49 2022] [   3594]     0  3594    75364     6074   225280        0             0 ilogtail
[Tue Nov  1 16:14:49 2022] [   4392]  1003  4392  1316580    99733  1228800        0             0 java
[Tue Nov  1 16:14:49 2022] [   4430]  1003  4430   782384    29313  1966080        0             0 doris_be
[Tue Nov  1 16:14:49 2022] [   6320]     0  6320     6900       93    94208        0             0 argusagent
[Tue Nov  1 16:14:49 2022] [   6322]     0  6322   375537     6510   401408        0             0 /usr/local/clou
[Tue Nov  1 16:14:49 2022] [  24494]  1000 24494   556571     6570   389120        0             0 taihao_exporter
[Tue Nov  1 16:14:49 2022] [   4671]  1000  4671     5402      467    81920        0             0 taihao-proxy
[Tue Nov  1 16:14:49 2022] [  19908]     0 19908   203958     1501    90112        0             0 aliyun-service
[Tue Nov  1 16:14:49 2022] [  20010]     0 20010     4454       80    61440        0             0 assist_daemon
[Tue Nov  1 16:14:49 2022] [  24520]  1002 24520  1543891   141153  1974272        0             0 java
[Tue Nov  1 16:14:49 2022] [    564]  1002   564 36766367 31086645 262057984        0             0 starrocks_be
[Tue Nov  1 16:14:49 2022] [  11488]     0 11488    60877      298   323584        0             0 sudo
[Tue Nov  1 16:14:49 2022] [  11489]     0 11489    37546      123   139264        0             0 lsof
[Tue Nov  1 16:14:49 2022] [  11491]  1000 11491    40743       82   159744        0             0 sudo
[Tue Nov  1 16:14:49 2022] [  11492]  1000 11492    31305      692    86016        0             0 python
[Tue Nov  1 16:14:49 2022] [  11494]  1000 11494    30985      374    81920        0             0 python
[Tue Nov  1 16:14:49 2022] [  11496]  1000 11496     5402      467    81920        0             0 tokio-runtime-w
[Tue Nov  1 16:14:49 2022] [  11497]     0 11497     2149       15    61440        0             0 systemd-cgroups
[Tue Nov  1 16:14:49 2022] Out of memory: Kill process 564 (starrocks_be) score 961 or sacrifice child
[Tue Nov  1 16:14:49 2022] Killed process 564 (starrocks_be) total-vm:147065468kB, anon-rss:124346580kB, file-rss:0kB, shmem-rss:0kB
[Tue Nov  1 16:14:55 2022] oom_reaper: reaped process 564 (starrocks_be), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

be.conf有改过什么配置吗?

以下是be的配置

default_rowset_type=beta
trash_file_expire_time_sec=86400
push_write_mbytes_per_sec=20
sys_log_level=INFO
sys_log_verbose_modules=
storage_root_path=/mnt/disk1/starrocks_manual/storage;/mnt/disk2/starrocks_manual/storage;/mnt/disk3/starrocks_manual/storage;/mnt/disk4/starrocks_manual/storage;
priority_networks=10.0.0.0
brpc_port=8060
heartbeat_service_port=9050
write_buffer_size=258
be_port=9060
webserver_port=18040
sys_log_roll_num=10
sys_log_roll_mode=SIZE-MB-1024

执行下 free -g

之前是在2.3.2遇到这样的问题,说是2.4已经解决了。