非int类型字段的精确去重太难搞了，物化视图，bitmap都不太好使

U_1655714884510_0366 · 2022年07月22日 09:28

背景：
想要精确去重的字段，类似这种
字段：
asjlfjaojflsfjjiworiovnN2=SAF+SFJLAJFowefjwfjOJO123+SNK
asfj-fasl-sasw-salf-aslj-asdf-2wnr-nvia

数据量：10亿

去重后结果数据量：1亿

尝试方案：
1.建全局字典，导入starrocks中，建立agg模型表，字段类型为 bitmap union_bitmap
但是查询时间近1s

profile分析：
count(distinct string)查询时，底层已经match bitmap_union_count() 函数；为效率最高了吧

2.建 duplicate 表，字段类型为 varchar，直接count(distinct string) 查询时间超过 6s；
但是不知道该如何建物化视图，去做精确去重
没法用 bitmap_union（to_bitmap（））
若通过 bitmap_hash等方式，建立bitmap类型的物化视图，则数据不准，hash存在冲突，不精确。

求教，社区同学有啥思路吗非常感谢！！

许秀不许秀 · 2022年07月23日 06:21

发个profile

U_1655714884510_0366 · 2022年07月25日 02:55

profile.txt (170.1 KB)

U_1655714884510_0366 · 2022年07月25日 02:56

已经补上profile 辛苦看下感谢

许秀不许秀 · 2022年07月28日 06:22

set new_planner_agg_stage=4;
然后直接select count distinct string 看看要多久，然后再发一个profile

U_1655714884510_0366 · 2022年07月28日 09:06

profile2.txt (173.4 KB)

U_1655714884510_0366 · 2022年07月28日 09:10

修改set后查询时间还是差不多 800ms左右

U_1655714884510_0366 · 2022年07月28日 09:10

建表语句
ddl.txt (2.0 KB)

许秀不许秀 · 2022年07月28日 10:28

set 设置完之后直接count distinct duplicate 表要多久

U_1655714884510_0366 · 2022年07月29日 05:28

15s
duplicateprofile.txt (346.7 KB)

许秀不许秀 · 2022年07月29日 09:09

可以考虑把并行调大点