Paimon大表查询,Scan 阶段并发instance只有 3

【详述】查询一个paimon 大表非常慢,表总数据行数大约22 亿,集群 100 个 backends节点,查看 profile发现数据 scan只有 3 实例运行, 如何充分利用集群节点增加查询并发?查询 iceberg表,scan阶段则有 100 实例运行, 不确定是否是 paimon catalog 实现的问题?
paimon 表 200 bucket
查询语句 select count(1) from paimon.xxx.xxxx;
【StarRocks版本】3.2.4
【集群规模】1fe +100 be
@trueeyu @yuchen1019 @Doni 帮忙看下感谢

执行计划如下
±--------------------------------------------------------------------------------------------------------------------------------+
| Explain String |
±--------------------------------------------------------------------------------------------------------------------------------+
| PLAN FRAGMENT 0(F01) |
| Output Exprs:20: count |
| Input Partition: UNPARTITIONED |
| RESULT SINK |
| |
| 4:AGGREGATE (merge finalize) |
| | aggregate: count[([20: count, BIGINT, false]); args: TINYINT; result: BIGINT; args nullable: true; result nullable: false] |
| | cardinality: 1 |
| | column statistics: |
| | * count–>[0.0, 3.533228509E9, 0.0, 8.0, 1.0] ESTIMATE |
| | |
| 3:EXCHANGE |
| distribution type: GATHER |
| cardinality: 1 |
| |
| PLAN FRAGMENT 1(F00) |
| |
| Input Partition: RANDOM |
| OutPut Partition: UNPARTITIONED |
| OutPut Exchange Id: 03 |
| |
| 2:AGGREGATE (update serialize) |
| | aggregate: count[(1); args: TINYINT; result: BIGINT; args nullable: false; result nullable: false] |
| | cardinality: 1 |
| | column statistics: |
| | * count–>[0.0, 3.533228509E9, 0.0, 8.0, 1.0] ESTIMATE |
| | |
| 1:Project |
| | output columns: |
| | 22 <-> 1 |
| | cardinality: 3533228509 |
| | column statistics: |
| | * auto_fill_col–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE |
| | |
| 0:PaimonScanNode |
| TABLE: xxxxxxxxxxxx |
| partitions=1/1 |
| avgRowSize=0.0 |
| cardinality: -1 |
| column statistics: |
±--------------------------------------------------------------------------------------------------------------------------------+
40 rows in set (3.38 sec)

profile (60.8 KB)

image
image

大致定位到原因了,com.starrocks.qe.HDFSBackendSelector.HdfsScanRangeHasher#acceptScanRangeLocations没有考虑到paimon

下面是一个 paimon的THdfsScanRange具体信息,没有一个条件命中com.starrocks.qe.HDFSBackendSelector.HdfsScanRangeHasher#acceptScanRangeLocations中的primitiveSink,所以最终所有 paimon THdfsScanRange计算的 hash 值都是 0


com.starrocks.qe.HDFSBackendSelector.HdfsScanRangeHasher#acceptScanRangeLocations

因为 hash值都是 0, 所以 hashring 返回的始终是同一组 backend, com.starrocks.qe.HDFSBackendSelector#computeScanRangeAssignment中

一个修复思路是把 split bucket 信息放入THdfsScanRange,根据 bucket 信息做 hash
com.starrocks.planner.PaimonScanNode#addSplitScanRangeLocations

提交 issue

非常棒的分析,希望社区能尽快修复一下

感谢反馈:在这个pr里一起修复:https://github.com/StarRocks/starrocks/pull/44330

1赞