grouping sets 的查询慢问题

coder_coding · 2022年12月2日 02:24

【详述】
sql 1：不使用 grouping sets ，对单一字段进行group by操作
– 1 并行度 1m38s
– 16 并行度 1m40s
SELECT
#days
, count(#distinct_id) as cnt
, count(distinct #distinct_id) as d_cnt
from ruixue_bigdata_test.dwd_fact_event_login_tst100yi
group by #days
;
1并行度和16并行度的查询时长差不多。

sql 2：不使用 grouping sets ，对单一字段进行group by操作
– 1 并行度 1m34s
– 16 并行度 1m39s
SELECT
#days
, reg_appid
, count(#distinct_id) as cnt
, count(distinct #distinct_id) as d_cnt
from ruixue_bigdata_test.dwd_fact_event_login_tst100yi
group by #days, reg_appid
;
1并行度和16并行度的查询时长差不多。

sql3：使用grouping sets
– 1 并行度 4m23s
– 16 并行度 4m32s
SELECT
#days
, reg_appid
, count(#distinct_id) as cnt
, count(distinct #distinct_id) as d_cnt
from ruixue_bigdata_test.dwd_fact_event_login_tst100yi
group by grouping sets((#days),(#days, reg_appid))
;
1并行度和16并行度的查询时长差不多，但是远大于sql 1 + sql 2的时长。

dwd_fact_event_login_tst100yi 单表数据量是15亿，目前不清楚为什么grouping sets的性能差。

【StarRocks版本】2.4.0
【集群规模】3fe + 5be

LIANGCHAOHUA · 2022年12月2日 06:50

收到，我们确认一下。

LIANGCHAOHUA · 2022年12月2日 06:57

麻烦提供这个SQL的profile

coder_coding · 2022年12月2日 07:33

grouping sets 的 profile.txt (49.4 KB)
16并行度的profile

coder_coding · 2022年12月2日 07:35

麻烦大佬看下profile

LIANGCHAOHUA · 2022年12月2日 08:21

现在看瓶颈点确实是在repeat完之后的聚合，因为聚合的基数非常大（30亿）处理的比较慢。能让业务先试着手动改成union all看看
SELECT
#days
, reg_appid
, count(#distinct_id) as cnt
, count(distinct #distinct_id) as d_cnt
from ruixue_bigdata_test.dwd_fact_event_login_tst100yi
group by (#days)
union all
SELECT
#days
, reg_appid
, count(#distinct_id) as cnt
, count(distinct #distinct_id) as d_cnt
from ruixue_bigdata_test.dwd_fact_event_login_tst100yi
group by (#days, reg_appid)
;

coder_coding · 2022年12月2日 08:46

我们目前是分开执行，比如(#days)，(#days, reg_appid)就是2个sql单独执行。目前就是想着减少点sql数量，同时看能不能优化下查询，我百度也看了，别的数据引擎在grouping sets方面性能要好于单独执行的性能。

coder_coding · 2022年12月2日 08:49

刚刚执行了union all sql，查询时间是1m42s，跟单独sql的性能差不多。这个是grouping sets 的实现问题吗？

LIANGCHAOHUA · 2022年12月2日 09:33

咱们这个例子是grouping sets理论上更快才对的，研发要看下怎么优化，所以给您那个是临时方案。

coder_coding · 2022年12月2日 10:33

我也觉得是grouping sets理论是要更快的。很多数据引擎也是。麻烦有优化计划了，论坛里通知一下，谢谢啦

coder_coding · 2022年12月5日 06:06

我也觉得是grouping sets理论是要更快的。很多数据引擎也是。麻烦有优化计划了，论坛里通知一下，谢谢啦

coder_coding · 2023年01月9日 07:43

大佬，麻烦问一下，这个社区有计划优化一下吗？

LIANGCHAOHUA · 2023年01月9日 12:01

开发任务较多，排期，没那么快实现。

trueeyu · 2023年01月10日 03:13

我们先分析下。

coder_coding · 2023年01月13日 02:25

好的，麻烦大佬了