60亿的数据自join会爆内存不足,60亿的数据left join毫秒出结果

【详述】问题详细描述
sql:SELECT a.device_id,b.device_id
FROM test_db.app_log_new AS a
JOIN test_db.app_log_new AS b
ON a.device_id = b.device_id;

报错:Memory of process exceed limit. try consume:4294967296 Used: 54051743888, Limit: 56522069360. Mem usage has exceed the limit of BE

explain:
PLAN FRAGMENT 0
OUTPUT EXPRS:1: device_id | 65: device_id
PARTITION: UNPARTITIONED

RESULT SINK

5:EXCHANGE

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 1: device_id

STREAM DATA SINK
EXCHANGE ID: 05
UNPARTITIONED

4:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 1: device_id = 65: device_id
|
|----3:EXCHANGE
|
1:EXCHANGE

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 03
HASH_PARTITIONED: 65: device_id

2:OlapScanNode
TABLE: app_log_new
PREAGGREGATION: ON
PREDICATES: 65: device_id IS NOT NULL
partitions=1/1
rollup: app_log_new
tabletRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
cardinality=6189418607
avgRowSize=0.564775
numNodes=0

PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 01
HASH_PARTITIONED: 1: device_id

0:OlapScanNode
TABLE: app_log_new
PREAGGREGATION: ON
PREDICATES: 1: device_id IS NOT NULL
partitions=1/1
rollup: app_log_new
tabletRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
cardinality=6189418607
avgRowSize=0.564775
numNodes=0

==========================================================================================================================================================================

sql:SELECT a.device_id,b.device_id
FROM test_db.app_log_new AS a
left JOIN test_db.app_log_new AS b
ON a.device_id = b.device_id;

能执行,并且毫秒出结果

explain:
PLAN FRAGMENT 0
OUTPUT EXPRS:1: device_id | 65: device_id
PARTITION: UNPARTITIONED

RESULT SINK

5:EXCHANGE

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 1: device_id

STREAM DATA SINK
EXCHANGE ID: 05
UNPARTITIONED

4:HASH JOIN
| join op: LEFT OUTER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 1: device_id = 65: device_id
|
|----3:EXCHANGE
|
1:EXCHANGE

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 03
HASH_PARTITIONED: 65: device_id

2:OlapScanNode
TABLE: app_log_new
PREAGGREGATION: ON
partitions=1/1
rollup: app_log_new
tabletRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
cardinality=6189418607
avgRowSize=0.564775
numNodes=0

PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 01
HASH_PARTITIONED: 1: device_id

0:OlapScanNode
TABLE: app_log_new
PREAGGREGATION: ON
partitions=1/1
rollup: app_log_new
tabletRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
cardinality=6189418607
avgRowSize=0.564775
numNodes=0
【背景】做过哪些操作?
发现同一个表60亿数据,做left join 可以秒出结果,使用join就爆内存不足
【业务影响】
【StarRocks版本】2.5.5
【集群规模】例如:1fe(8核 32G 内存)+12be(16核 64G内存)
【机器信息】阿里云的机子
【联系方式】社区群14群-zwk 微信号wenkang1451

两种方式的执行计划发下: explain costs

sql: explain costs SELECT a.device_id,b.device_id
FROM test_db.app_log_new AS a
JOIN test_db.app_log_new AS b
ON a.device_id = b.device_id;

PLAN FRAGMENT 0(F05)
Output Exprs:1: device_id | 65: device_id
Input Partition: UNPARTITIONED
RESULT SINK

5:EXCHANGE
cardinality: 13551079834693250

PLAN FRAGMENT 1(F04)

Input Partition: HASH_PARTITIONED: 1: device_id
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 05

4:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| equal join conjunct: [1: device_id, VARCHAR, true] = [65: device_id, VARCHAR, true]
| cardinality: 13551079834693250
| column statistics:
| * device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE
| * device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE
|
|----3:EXCHANGE
| cardinality: 6189418607
|
1:EXCHANGE
cardinality: 6189418607

PLAN FRAGMENT 2(F02)

Input Partition: RANDOM
OutPut Partition: HASH_PARTITIONED: 65: device_id
OutPut Exchange Id: 03

2:OlapScanNode
table: app_log_new, rollup: app_log_new
preAggregation: on
Predicates: 65: device_id IS NOT NULL
partitionsRatio=1/1, tabletsRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
actualRows=6189418607, avgRowSize=0.564775
cardinality: 6189418607
column statistics:
* device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE

PLAN FRAGMENT 3(F00)

Input Partition: RANDOM
OutPut Partition: HASH_PARTITIONED: 1: device_id
OutPut Exchange Id: 01

0:OlapScanNode
table: app_log_new, rollup: app_log_new
preAggregation: on
Predicates: 1: device_id IS NOT NULL
partitionsRatio=1/1, tabletsRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
actualRows=6189418607, avgRowSize=0.564775
cardinality: 6189418607
column statistics:
* device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE

sql:explain costs SELECT a.device_id,b.device_id
FROM test_db.app_log_new AS a
left JOIN test_db.app_log_new AS b
ON a.device_id = b.device_id;

PLAN FRAGMENT 0(F05)
Output Exprs:1: device_id | 65: device_id
Input Partition: UNPARTITIONED
RESULT SINK

5:EXCHANGE
cardinality: 13551079834693250

PLAN FRAGMENT 1(F04)

Input Partition: HASH_PARTITIONED: 1: device_id
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 05

4:HASH JOIN
| join op: LEFT OUTER JOIN (PARTITIONED)
| equal join conjunct: [1: device_id, VARCHAR, true] = [65: device_id, VARCHAR, true]
| cardinality: 13551079834693250
| column statistics:
| * device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE
| * device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE
|
|----3:EXCHANGE
| cardinality: 6189418607
|
1:EXCHANGE
cardinality: 6189418607

PLAN FRAGMENT 2(F02)

Input Partition: RANDOM
OutPut Partition: HASH_PARTITIONED: 65: device_id
OutPut Exchange Id: 03

2:OlapScanNode
table: app_log_new, rollup: app_log_new
preAggregation: on
partitionsRatio=1/1, tabletsRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
actualRows=6189418607, avgRowSize=0.564775
cardinality: 6189418607
column statistics:
* device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE

PLAN FRAGMENT 3(F00)

Input Partition: RANDOM
OutPut Partition: HASH_PARTITIONED: 1: device_id
OutPut Exchange Id: 01

0:OlapScanNode
table: app_log_new, rollup: app_log_new
preAggregation: on
partitionsRatio=1/1, tabletsRatio=50/50
tabletList=115244,115248,115252,115256,115260,115264,115268,115272,115276,115280 …
actualRows=6189418607, avgRowSize=0.564775
cardinality: 6189418607
column statistics:
* device_id–>[-Infinity, Infinity, 0.0, 0.564775, 2827.0] ESTIMATE

收到了,我们排查一下

LeftJoin这个,返回多少数据?

返回多少数据啥意思,我用dbeaver工具的,都是用limit,默认返回200条

拿执行计划的时候,加上个 limit 200; 再发下

我发8亿的数据查询的sql的执行计划给你参考下,可以不,因为我发现只要是join 都很慢,只要是left join 都是毫秒级别,原理应该差不多

我大概知道什么原因,所以需要加Limit 200的LeftJOin的执行计划,辅助判断下

sql:
explain SELECT a.device_id,b.device_id
FROM test_db.app_log_new_003 AS a
JOIN test_db.app_log_new_003 AS b
ON a.device_id = b.device_id
LIMIT 0, 200;

查询计划:
PLAN FRAGMENT 0
OUTPUT EXPRS:1: device_id | 65: device_id
PARTITION: UNPARTITIONED

RESULT SINK

4:EXCHANGE
limit: 200

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 04
UNPARTITIONED

3:HASH JOIN
| join op: INNER JOIN (BUCKET_SHUFFLE)
| colocate: false, reason:
| equal join conjunct: 1: device_id = 65: device_id
| limit: 200
|
|----2:EXCHANGE
|
0:OlapScanNode
TABLE: app_log_new_003
PREAGGREGATION: ON
PREDICATES: 1: device_id IS NOT NULL
partitions=1/1
rollup: app_log_new_003
tabletRatio=80/80
tabletList=130377,130381,130385,130389,130393,130397,130401,130405,130409,130413 …
cardinality=1823994338
avgRowSize=1.03679
numNodes=0

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 02
BUCKET_SHUFFLE_HASH_PARTITIONED: 65: device_id

1:OlapScanNode
TABLE: app_log_new_003
PREAGGREGATION: ON
PREDICATES: 65: device_id IS NOT NULL
partitions=1/1
rollup: app_log_new_003
tabletRatio=80/80
tabletList=130377,130381,130385,130389,130393,130397,130401,130405,130409,130413 …
cardinality=1823994338
avgRowSize=1.03679
numNodes=0

========================================================================================================================================================
sql:
explain SELECT a.device_id,b.device_id
FROM test_db.app_log_new_003 AS a
left JOIN test_db.app_log_new_003 AS b
ON a.device_id = b.device_id
LIMIT 0, 200;

查询计划:
PLAN FRAGMENT 0
OUTPUT EXPRS:1: device_id | 65: device_id
PARTITION: UNPARTITIONED

RESULT SINK

4:EXCHANGE
limit: 200

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 04
UNPARTITIONED

3:HASH JOIN
| join op: RIGHT OUTER JOIN (BUCKET_SHUFFLE)
| colocate: false, reason:
| equal join conjunct: 65: device_id = 1: device_id
| limit: 200
|
|----2:EXCHANGE
| limit: 200
|
0:OlapScanNode
TABLE: app_log_new_003
PREAGGREGATION: ON
partitions=1/1
rollup: app_log_new_003
tabletRatio=80/80
tabletList=130377,130381,130385,130389,130393,130397,130401,130405,130409,130413 …
cardinality=1823994338
avgRowSize=1.03679
numNodes=0

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 02
BUCKET_SHUFFLE_HASH_PARTITIONED: 1: device_id

1:OlapScanNode
TABLE: app_log_new_003
PREAGGREGATION: ON
partitions=1/1
rollup: app_log_new_003
tabletRatio=80/80
tabletList=130377,130381,130385,130389,130393,130397,130401,130405,130409,130413 …
cardinality=200
avgRowSize=1.03679
numNodes=0
limit: 200

那我知道原因了,如果是LeftJoin,Join的Hash表,他只需要保留200个值就行,因为是LeftJoin,一定可以保证返回200行。而InnerJoin不一样,他要Hash存放所有值,因为可能存在Join不上的情况。