【be】be crash

【详述】问题详细描述 be crash,同时出现coredump进程,集群无法访问
【背景】做过哪些操作?跟踪fe日志,当时执行一条sql,然后be出现异常日志
【业务影响】集群无法访问
【StarRocks版本】例如:2.4.0
【集群规模】例如:3fe(1 follower+2observer)+ 8be(fe与be独立部署)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群12-金谡 jinsu@moojing.com 谢谢
【附件】

  • fe.log/beINFO/相应截图
    be.out日志
    start time: Tue 03 Jan 2023 02:30:03 PM CST
    query_id:52684183-d918-11ed-92dd-525400e8ff46, fragment_instance:52684183-d918-11ed-92dd-525400e8ff48
    *** Aborted at 1681293404 (unix time) try “date -d @1681293404” if you are using GNU date ***
    PC: @ 0x2f35d3f starrocks::vectorized::HashJoiner::_has_null()
    *** SIGSEGV (@0x0) received by PID 555045 (TID 0x7f944a3ad700) from PID 0; stack trace: ***
    @ 0x481e332 google::(anonymous namespace)::FailureSignalHandler()
    @ 0x7f950f8a73ab os::Linux::chained_handler()
    @ 0x7f950f8abefc JVM_handle_linux_signal
    @ 0x7f950f89ed48 signalHandler()
    @ 0x7f950ef72420 (unknown)
    @ 0x2f35d3f starrocks::vectorized::HashJoiner::_has_null()
    @ 0x2ec31ea starrocks::pipeline::HashJoinBuildOperator::set_finishing()
    @ 0x2e76ff7 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @ 0x2e77099 starrocks::pipeline::PipelineDriver::_mark_operator_finished()
    @ 0x2e77629 starrocks::pipeline::PipelineDriver::_mark_operator_cancelled()
    @ 0x2e779b2 starrocks::pipeline::PipelineDriver::_check_fragment_is_canceled()
    @ 0x2e77dd0 starrocks::pipeline::PipelineDriver::process()
    @ 0x2e6e5a3 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @ 0x2680a05 starrocks::ThreadPool::dispatch_thread()
    @ 0x267bf2a starrocks::thread::supervise_thread()
    @ 0x7f950ef66609 start_thread
    @ 0x7f950ed2c133 clone
    @ 0x0 (unknown)
    start time: Wed 12 Apr 2023 06:02:58 PM CST

fe audit_log日志
2023-04-12 17:56:44,220 [query] |Client=10.19.189.0:17481|User=root|AuthorizedUser=‘root’@’%’|ResourceGroup=default_wg|Catalog=default_catalog|Db=eanalyze|State=ERR|ErrorCode=CANCELLED|Time=3884|ScanBytes=0|ScanRows=0|ReturnRows=0|CpuCostNs=0|MemCostBytes=0|StmtId=486536|QueryId=52684183-d918-11ed-92dd-525400e8ff46|IsQuery=true|feIp=10.19.134.135|Stmt=select t1.item_id, t1.cat1, date_format(t1.time, ‘%Y-%m’) online_time from item2 t1 where time=‘2020-03-01’ and cat1=‘33’ and item_id not in (select item_id from test_online_time_tmall )|Digest=|PlanCpuCost=9.706265408775068E9|PlanMemCost=9.706253497E9

  • 慢查询:
    • Profile信息
    • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
    • pipeline是否开启:show variables like ‘%pipeline%’;
    • be节点cpu和内存使用率截图
  • 查询报错:
  • be crash
    • be.out

收到 我们先排查一下

麻烦在BE上打开ulimit -c unlimited。重启BE,一般core文件在be/lib下,命名是core.xxxx,执行触发问题的SQL,BE crash了,把core文件发给咱们,最后把core dump关闭,ulimit -c 0, 谢谢!