数据库运行pipeline_poller线程CPU持续使用率百分百

U_1670518188057_5444 · 2023年01月31日 15:07

【详述】
（1）数据库突然间写入和查询都被hang住，查看发现pipeline_poller线程cpu持续使用率百分百，重启be节点数据库恢复正常，但后续N小时后会出现重复问题。
（2）通过修改参数“ SET global enable_pipeline_engine = false;set global parallel_fragment_exec_instance_num = 8;” 后数据库读写正常，但是pipeline_poller线程还是会发生持续使用率百分百的情况。
（3）pipeline_poller 线程cpu持续使用百分百阶段，业务上没有查询，只有五分钟周期的数据写入。

【业务影响】
数据库hang住，读写均不能进行。

【StarRocks版本】
2.4.1
【集群规模】
8节点集群：3fe（1 follower+2observer）+5be

【机器信息】
华为云：通用计算增强型 | c7.8xlarge.2 | 32vCPUs | 64GiB

【1-top信息】
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17983 root 20 0 76.7g 11.4g 27304 R 99.9 18.8 1702:41 pipeline_poller
18339 root 20 0 76.7g 11.4g 27304 S 2.0 18.8 29:27.35 cumulat_compact
18341 root 20 0 76.7g 11.4g 27304 S 2.0 18.8 28:55.84 cumulat_compact
18351 root 20 0 76.7g 11.4g 27304 S 2.0 18.8 28:57.12 cumulat_compact
18333 root 20 0 76.7g 11.4g 27304 S 1.7 18.8 29:41.16 cumulat_compact
18334 root 20 0 76.7g 11.4g 27304 S 1.7 18.8 28:47.82 cumulat_compact
18338 root 20 0 76.7g 11.4g 27304 S 1.7 18.8 28:54.17 cumulat_compact
备注：观察发现pipeline_poller 线程的%MEM 在不断的增大。

【2-strace信息】
开始使用strace 追踪还有循信息输出如下。隔一天后就没有任何输出。
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({tv_sec=10, tv_nsec=0}, 0x7ffe91de0420) = 0

【3-日志】
曾经有一次当pipeline 进程百分白的时候，日志中大量刷新下述信息。重启后恢复。
pipeline_driver_poller.cpp:63] [Driver] Timeout, query_id=6beafddd-9617-11ed-97a1-fa163ee89bcf

帮忙看下是什么原因导致的，有什么方法可以避免这种情况的发生吗？

trueeyu · 2023年02月1日 02:00

有用资源组吗。。。。
？

U_1670518188057_5444 · 2023年03月8日 10:37

没有用资源组，后续开发的同学帮忙分析下来，是我在源码编译的时候，GCC11的编译参数造成的兼容性的问题。

“编译的 gcc11 带了 --with-default-libstdcxx-abi=gcc4-compatible。我们现在有一处代码对 gcc4 的兼容性不太好，建议带上 --with-default-libstdcxx-abi=new 重新编译一下 gcc11。然后用新编译的 gcc11 重新编译 starrocks”

重新编译之后，就恢复正常了。