如果过滤条件带udf 相关的字段，执行效率很低，大数据量都执行不出来

coder_coding · 2023年06月28日 07:10

【详述】
使用udf函数计算后，如果过滤的是udf计算得到的值的话，执行会很慢，如果过滤的是非udf计算得到的值的话，执行就很快。

udf函数名：check_properties(string, string)

执行语句：

最外层 where 条件如果不跟udf函数有关的话，执行很快1.5s左右，count结果为 55960132

select count(1) from (
select
#time
,#event
, check_result
, properties as json_sample
from (
select *,check_properties(
(
select str
from (select event,GROUP_CONCAT(concat(column_name,’:’,data_type),’,’) as str from bi_bury_event_attribute_v
where table_name != ‘ods_event_user’ and starts_with(event, ‘#’) = 0 and starts_with(event, ‘rx’) = 0
group by event) t
where s.#event = t.event
)
,properties) as check_result
from ods_fact_event_track s
where #create_time >= ‘2023-06-27 14:00:00’ and #type = ‘track’ and #event in (select event from bi_bury_event_attribute_v where table_name != ‘ods_event_user’ and starts_with(event, ‘#’) = 0 and starts_with(event, ‘rx’) = 0) ) t
) t
where #event is not null
;

最外层 where 条件如果跟udf函数有关的话，执行很慢，基本执行不出来。少数据量还能勉强计算出来，count为 829064时，计算时间为28s左右。差距太大了。

select count(1) from (
select
#time
,#event
, check_result
, properties as json_sample
from (
select *,check_properties(
(
select str
from (select event,GROUP_CONCAT(concat(column_name,’:’,data_type),’,’) as str from bi_bury_event_attribute_v
where table_name != ‘ods_event_user’ and starts_with(event, ‘#’) = 0 and starts_with(event, ‘rx’) = 0
group by event) t
where s.#event = t.event
)
,properties) as check_result
from ods_fact_event_track s
where #create_time >= ‘2023-06-28 14:00:00’ and #type = ‘track’ and #event in (select event from bi_bury_event_attribute_v where table_name != ‘ods_event_user’ and starts_with(event, ‘#’) = 0 and starts_with(event, ‘rx’) = 0) ) t
) t
where check_result is not null
;

不加任何筛选条件的时候，跟第一种情况一致，55960132 数据量下执行也才1.2s左右

image1508×910 37.8 KB

【背景】
【业务影响】
【StarRocks版本】2.5.4
【集群规模】3fe + 3be

coder_coding · 2023年06月28日 07:26

补充3种情况的profile

最外层 where 条件不带跟udf函数有关的profile
where条件不带udf相关字段的profile (288.5 KB)
最外层 where 条件带跟udf函数有关的profile
where条件带udf函数相关的profile (305.0 KB)
不带where条件的profile
不加where条件的profile (295.0 KB)

coder_coding · 2023年06月28日 08:14

从profile中发现，只要where带udf相关字段时，profile中 ods_fact_event_track表的 BytesRead会特别大（十几GB），如果时不带udf相关字段时，BytesRead 才几百兆。sql就筛选条件不一样。

LIANGCHAOHUA · 2023年06月29日 06:46

这个要看udf的实现逻辑的性能了。

coder_coding · 2023年06月30日 01:34

不太明白，为啥同样是count(*)，为啥只要是筛选udf的结果字段，BytesRead就非常大。

coder_coding · 2023年06月30日 01:36

不带where筛选条件时，直接count(*)的时候，执行速度也很快

LIANGCHAOHUA · 2023年07月2日 14:26

要看具体的UDF的代码才知道