starrocks怎么实现ck中的localQuery

求问,starrocks怎么实现ck中的localQuery呀?有明细数据去重场景,希望把待去重id作为分桶键,然后去重的时候节点内去重即可呀?

ck中的localQuery节点内计算出的去重结果数值,可以减少很多网络传输

这个就是 colocate agg

  1. 指定 colocate key (非分区表不用)
  2. group by 包含 bucket key,…

那太好了,不过我在官方文档没找到colocate agg的相关资料,老师有空能辛苦帮忙举例说明下我需要怎么才能让我的聚合查询走colocate agg吗?

【场景】我的场景是分区的bitmap聚合表,表结构如下,按block_offset分桶,需要桶内求出bitmap的基数结果,然后各桶之间的数据进行求和即可

【表结构】
CREATE TABLE bitmap_table (
dt varchar(65533) NOT NULL COMMENT “dt”,
dim1 varchar(65533) NOT NULL COMMENT “dim1”,
block_offset bigint NOT NULL COMMENT “用户id的二进制低x位”,
user_bitmap bitmap BITMAP_UNION NULL COMMENT “用户bitmap(剔除低x位后,其余位存储为bitmap)”
) ENGINE=OLAP
AGGREGATE KEY(dt,dim1,block_offset)
COMMENT “bitmap_table”
PARTITION BY (dt)
DISTRIBUTED BY HASH(block_offset)
PROPERTIES (
“replication_num” = “3”,
“in_memory” = “false”,
“enable_persistent_index” = “false”,
“replicated_storage” = “true”,
“partition_live_number” = “100”,
“compression” = “LZ4”
);

【查询】
希望能走colocate agg,保障节点中直接计算出uv结果,然后节点间进行uv的求和
select
dim1
,sum(uv) as uv
from(
select
dim1
,block_offset
,bitmap_union_count(user_bitmap) as uv
from bitmap_table
group by dim1
,block_offset
) t1
group by dim1

你好,是说colocate agg可以直接参考colocate join是吗?

哦,没注意到你这个是单表不是join。
这个你可以explain看下里面的子查询是不是已经按你的期望进行了,因为这个schema和sql看起来节点之间并没有什么东西需要交换,毕竟已经按block_offset分桶了

好的,感谢大佬,我看了下执行计划,group by block_offset后,节点间的确只有结果数据的交换了,我这再实际测试下看看

大佬,我发现在查询分区表的单个分区的时候,group by带上分桶键,的确不会进行bitamp的shuffle,会直接在节点内计算出bitmap的整形结果,然后节点间汇总,但是同样的sql,查询多个分区的时候,就会多一步shuffle,节点内会进行bitmap_union,然后节点间shuffle后进行bitmap_union_count :sob:

case如下:
【查询单个分区】
EXPLAIN
select
dt
,dim2
,sum(pv2) as pv2
,bitmap_union_count(custom_bitmap2) as custom_num2
from dm.test_lzl_bitmap_20240822_005_part
where dt in (‘20241001’)
group by dt,dim2;

======「查询计划」:
Explain String
PLAN FRAGMENT 0
OUTPUT EXPRS:2: dt | 4: dim2 | 9: sum | 10: bitmap_union_count
PARTITION: UNPARTITIONED

RESULT SINK

2:EXCHANGE

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 02
UNPARTITIONED

1:AGGREGATE (update finalize)
| output: sum(6: pv2), bitmap_union_count(8: custom_bitmap2)
| group by: 2: dt, 4: dim2
|
0:OlapScanNode
TABLE: test_lzl_bitmap_20240822_005_part
PREAGGREGATION: ON
PREDICATES: 2: dt = ‘20241001’
partitions=1/121
rollup: test_lzl_bitmap_20240822_005_part
tabletRatio=2/2
tabletList=2318946,2318948
cardinality=1179
avgRowSize=1048595.9

【查询多个分区】
EXPLAIN
select
dt
,dim2
,sum(pv2) as pv2
,bitmap_union_count(custom_bitmap2) as custom_num2
from dm.test_lzl_bitmap_20240822_005_part
where dt in (‘20241001’,‘20241002’)
group by dt,dim2
=====执行计划
Explain String
PLAN FRAGMENT 0
OUTPUT EXPRS:2: dt | 4: dim2 | 9: sum | 10: bitmap_union_count
PARTITION: UNPARTITIONED

RESULT SINK

4:EXCHANGE

PLAN FRAGMENT 1 – !!!!多了一步!!!!
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 2: dt, 4: dim2

STREAM DATA SINK
EXCHANGE ID: 04
UNPARTITIONED

3:AGGREGATE (merge finalize)
| output: sum(9: sum), bitmap_union_count(10: bitmap_union_count)
| group by: 2: dt, 4: dim2
|
2:EXCHANGE

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 02
HASH_PARTITIONED: 2: dt, 4: dim2

1:AGGREGATE (update serialize)
| STREAMING
| output: sum(6: pv2), bitmap_union_count(8: custom_bitmap2)
| group by: 2: dt, 4: dim2
|
0:OlapScanNode
TABLE: test_lzl_bitmap_20240822_005_part
PREAGGREGATION: ON
PREDICATES: 2: dt IN (‘20241001’, ‘20241002’)
partitions=2/121
rollup: test_lzl_bitmap_20240822_005_part
tabletRatio=4/4
tabletList=2318954,2318956,2318946,2318948
cardinality=2358
avgRowSize=1048595.9

大佬能再请教下如何制定colocate key吗?分区表不指定colocate key的确会无法走colocate agg

“colocate_with” = “group_name”

1赞

试了下,是可以的,感谢~