sql 触发 be core dump

SELECT
base_query.*
FROM
(
SELECT
count(
distinct xxx621_column_8cRm0F7LlVYKmR0BOw7E
) as i_v_xxx621_column_8cRm0F7LlVYKmR0BOw7E_1705645931677,
IF(
count(
distinct xxx621_column_8cRm0F7LlVYKmR0BOw7E
) > 400,
count(
distinct xxx621_column_8cRm0F7LlVYKmR0BOw7E
) -400,
0
) as i_v_column88975_1705646119869,
sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 3 then 1
else 0
end
) as i_v_column88975_1705647714467,
IF(
sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 3 then 1
else 0
end
) < 40,
40 - sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 3 then 1
else 0
end
),
0
) as i_v_column88975_1705653314373,
sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 2 then 1
else 0
end
) as i_v_column88975_1705647914706,
sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 1 then 1
else 0
end
) as i_v_column88975_1705647925854,
sum (
case
when xxx621_column_m3IUm0DqsDvPW3CvBZWb = 4 then 1
else 0
end
) as i_v_column88975_1705651769346,
xxx62789_e7345f93cd4081345351b83af60344e1.xxx621_column_9yuJpqD4RY71YXztBUCC as xxx621_column_9yuJpqD4RY71YXztBUCC,
grouping(xxx621_column_9yuJpqD4RY71YXztBUCC) as xxx621_column_9yuJpqD4RY71YXztBUCC_total_flag
from
xxx62789_e7345f93cd4081345351b83af60344e1
group by
grouping sets (
(),
(xxx621_column_9yuJpqD4RY71YXztBUCC)
)
) AS base_query
order by
xxx621_column_9yuJpqD4RY71YXztBUCC_total_flag desc,
base_query.xxx621_column_9yuJpqD4RY71YXztBUCC asc
LIMIT
6000 OFFSET 0

xxx62789_e7345f93cd4081345351b83af60344e1 这个是一个 cte ( 多个 表 join )

这条语句会导致 BE core dump , 是为啥 ? be.out 日志文件内容

*** Aborted at 1705967445 (unix time) try “date -d @1705967445” if you are using GNU date ***
PC: @ 0x2d57320 starrocks::vectorized::FixedLengthColumnBase<>::append_selective()
*** SIGSEGV (@0x1000) received by PID 14905 (TID 0x7ffb0cafb700) from PID 4096; stack trace: ***
@ 0x5b97b22 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7ffbd8c54630 (unknown)
@ 0x2d57320 starrocks::vectorized::FixedLengthColumnBase<>::append_selective()
@ 0x50cb458 starrocks::vectorized::NullableColumn::append_selective()
@ 0x50ae6ca starrocks::vectorized::Chunk::append_selective()
@ 0x324dfbe starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
@ 0x324e897 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
@ 0x2d906c0 starrocks::pipeline::PipelineDriver::process()
@ 0x51add6a starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4b968f2 starrocks::ThreadPool::dispatch_thread()
@ 0x4b9138a starrocks::thread::supervise_thread()
@ 0x7ffbd8c4cea5 start_thread
@ 0x7ffbd8267b0d __clone
@ 0x0 (unknown)
start time: Tue Jan 23 07:53:36 CST 2024

表现是执行了这条后,系统 load 达到 1000多 .

starrocks 版本 : 2.5.10

load 1000多忽略,跟这个没有关系,

感觉还是 SR 有啥bug,导致挂掉

可以升级到2.5的最新小版本试试,感觉像已修复问题。explain costs 这个SQL,发下执行计划,我们确认下。

已私发你了

be.warn里搜索下large mem

搜了,没有这个 “large mem” 的内容

在 2.5.13 里也有这个问题

2.5的最新小版本是2.5.20

set enable_rewrite_groupingsets_to_union_all=true; 看看还有这个Crash吗

这个环境是测试环境还是线上环境?

set enable_rewrite_groupingsets_to_union_all=true; 用了这个之后,不会 crash 哈,为啥

我们线上环境跟测试环境都可以重现

我们之前测试也发现 grouping set 改写成 union all 也是可以的,但不是所有的语句 grouping set 都不行,好像是满足一定的条件… 才会触发 crash ,有一条是必现的 crash

我找个Debug版本,然后在测试环境复现 下?这样方便查问题。

或是你测试环境先升级到2.5.20,看看还复现吗

我们这边不好升级,在运维那边管控,那边动作比较慢,很多人在用。。。

对应SQL的explain costs发下?set enable_rewrite_groupingsets_to_union_all=false;时候的

加个微信,详细聊聊?lxhhust350@qq.com

嗯,已经加了