4 台 be 节点同时全部挂掉

【详述】问题详细描述: 4 台 be 节点同时挂掉
【背景】做过哪些操作?: stream load/ 定时 sql
【业务影响】
【StarRocks版本】例如:2.3.1
【集群规模】例如:3fe(3 follower)+4be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:16/64G/万兆
be.log/be.warn 都没有报错,每个 be.out 都有如下内容:
*** Aborted at 1665999327 (unix time) try “date -d @1665999327” if you are using GNU date ***
PC: @ 0x31ad5b5 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
*** SIGSEGV (@0xfffffffffffffff8) received by PID 2450 (TID 0x7fdcdbef5700) from PID 18446744073709551608; stack trace: ***
@ 0x3fab972 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fdd62dc8630 (unknown)
@ 0x31ad5b5 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
@ 0x3176b57 starrocks::vectorized::VectorizedIfExpr<>::evaluate()
@ 0x2d2443c starrocks::vectorized::VectorizedBinaryPredicate<>::evaluate()
@ 0x2c56bfc starrocks::ExprContext::evaluate()
@ 0x24f667f starrocks::ExecNode::eval_conjuncts()
@ 0x284d69c starrocks::pipeline::Operator::eval_conjuncts_and_in_filters()
@ 0x28da1bf starrocks::pipeline::RepeatOperator::pull_chunk()
@ 0x2898b33 starrocks::pipeline::PipelineDriver::process()
@ 0x288f54c starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x21347d9 starrocks::ThreadPool::dispatch_thread()
@ 0x213038a starrocks::thread::supervise_thread()
@ 0x7fdd62dc0ea5 start_thread
@ 0x7fdd623dbb0d __clone
@ 0x0 (unknown)

fe master 发现如下日志:

2022-10-17 17:35:29,029 INFO (thrift-server-pool-10|140) [ThriftServerEventProcessor.deleteContext():113] delete thrift context. client: TNetworkAddress(hostname:ip, port:51812)
2022-10-17 17:35:29,261 INFO (thrift-server-pool-1|127) [ThriftServerEventProcessor.deleteContext():113] delete thrift context. client: TNetworkAddress(hostname:ip, port:42588)
2022-10-17 17:35:29,451 INFO (thrift-server-pool-28|160) [ThriftServerEventProcessor.deleteContext():113] delete thrift context. client: TNetworkAddress(hostname:ip, port:33714)
2022-10-17 18:31:58,529 INFO (thrift-server-pool-49|29209)

系统 message 日志没看到报错

可能是oom
dmesg -T|grep -i oom

进程直接没了,看不了 oom

看下 syslog: /var/log/messages,可以看到操作系统记录的进程日志。

先升级2.3.3,再观察一下

系统日志没有报错,be 节点都看到这样一样日志: Oct 17 17:35:29 VM-0-2-centos systemd-logind: Removed session 37471. 时间是一样的

还在考虑升级的事情

这个应该是BUG,不太确定是否已修复,可以先升级到2.3.3看看。或是把CoreDump打开,获取个Core文件,我们分析下

方便加我微信看一下吗 CHN10151

git 提了个bug https://github.com/StarRocks/starrocks/issues/12275,之前搜没有相似的问题

小版本都系fix bug,很稳的

感谢大佬,明天瞅瞅升级的事情

启动BE前, 执行下ulimit -c unlimited 开个Core,万一再Crash,可以获取到更多信息

这个问题升级到2.3.3,还出现吗?

这个PR修复了这个问题 https://github.com/StarRocks/starrocks/pull/13217 等下个小版本发布后再尝试一下

1赞