Be异常重启

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】 否
【StarRocks版本】3.1.17
【集群规模】例如:3fe(+7be

  • Profile信息
  • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
    parallel_fragment_exec_instance_num,1
  • pipeline是否开启:show variables like ‘%pipeline%’;
    enable_pipeline_engine,true
    enable_pipeline_query_statistic,true
    max_pipeline_dop,64
    pipeline_dop,8
    pipeline_profile_level,1
    pipeline_sink_dop,0

报错如下:
3.1.17 RELEASE (build 67f1293)
query_id:be7b1ab3-1ff6-11f0-a8aa-525400f79246, fragment_instance:be7b1ab3-1ff6-11f0-a8aa-525400f7926b
tracker:process consumption: 133878712480
tracker:query_pool consumption: 189860850
tracker:load consumption: 45137068
tracker:metadata consumption: 6377695240
tracker:tablet_metadata consumption: 13326672
tracker:rowset_metadata consumption: 2190951165
tracker:segment_metadata consumption: 562190093
tracker:column_metadata consumption: 3611227070
tracker:tablet_schema consumption: 2766576
tracker:segment_zonemap consumption: 482228188
tracker:short_key_index consumption: 62998339
tracker:column_zonemap_index consumption: 1417135782
tracker:ordinal_index consumption: 1147551408
tracker:bitmap_index consumption: 26080
tracker:bloom_filter_index consumption: 10763640
tracker:compaction consumption: 218392
tracker:schema_change consumption: 0
tracker:column_pool consumption: 4282625874
tracker:page_cache consumption: 38359998432
tracker:update consumption: 69002977081
tracker:chunk_allocator consumption: 2143430448
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1745380506 (unix time) try “date -d @1745380506” if you are using GNU date ***
PC: @ 0x37959fd starrocks::pipeline::Operator::eval_no_eq_join_runtime_in_filters()
*** SIGSEGV (@0x644970726f7b) received by PID 479354 (TID 0x7fd9bffab700) from PID 1886547835; stack trace: ***
@ 0x688c0c2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fdc48f10c17 os::Linux::chained_handler()
@ 0x7fdc48f18565 JVM_handle_linux_signal
@ 0x7fdc48f0d7b3 signalHandler()
@ 0x7fdc483ae630 (unknown)
@ 0x37959fd starrocks::pipeline::Operator::eval_no_eq_join_runtime_in_filters()
@ 0x37821e4 starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
@ 0x3848e29 starrocks::pipeline::PipelineDriver::process()
@ 0x3839c97 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2ed31cc starrocks::ThreadPool::dispatch_thread()
@ 0x2ecc89a starrocks::thread::supervise_thread()
@ 0x7fdc483a6ea5 start_thread
@ 0x7fdc477a7b0d __clone
@ 0x0 (unknown)

报错中有query id,发一下对应的sql

query.txt (45.6 KB) dump_file .txt (429.0 KB) profile.txt (1020.4 KB)

发了,目前复现不出来,但每次重启都有这个sql,没有开启profile, 单独运行这个sql开启profile和获取dump的 感觉是正常的

每次集群异常重启在be.out中都显示的是这个sql? 异常重启的频率大概是多久一次?

平均一周1到2次吧,我看了三次都有这个sql,业务高峰期就偶现,单独运行或者 业务低峰期 运行了很多次也没有重启,

这个有core吗? 如果有的话,我们可以在线开个视频,debug一下

这个query很复杂,含有join数量比较多,在优化过程中,可能还会插入一些join, 可能在某些情况下,优化产生了nested loop join, 才会触发这个错误. 这个错误栈是NL join产生的.

可以加join hint 来避免吗

没有哦

但这种是客户端执行很多次都不会重启,但是业务高峰期才会出现几次be重启的情况,是查询过多,导致cbo负优化了吗