join与union联合使用问题

【详述】join与union联合使用时数据增多
【StarRocks版本】2.0

sql如下:
SELECT *
FROM (
SELECT buyer_id user_id, 1 order_number, payment_amount order_amount
FROM ods_moses_order.ods_sub_order
WHERE created_time between UNIX_TIMESTAMP(DATE_FORMAT(date_format(‘20220516’, ‘%Y%m%d’),’%Y-%m-%d 00:00:00.000’)) * 1000 and UNIX_TIMESTAMP(DATE_FORMAT(date_format(‘20220516’, ‘%Y%m%d’),’%Y-%m-%d 23:59:59.999’)) * 1000
AND pay_status = 2
UNION ALL
SELECT user_id, 1 order_number, payment_amount order_amount
FROM ods_mcs_third_order.ods_sub_order
WHERE created_time between UNIX_TIMESTAMP(DATE_FORMAT(date_format(‘20220516’, ‘%Y%m%d’),’%Y-%m-%d 00:00:00.000’)) * 1000 and UNIX_TIMESTAMP(DATE_FORMAT(date_format(‘20220516’, ‘%Y%m%d’),’%Y-%m-%d 23:59:59.999’)) * 1000
AND pay_status = 2
) t_od JOIN dwd_yd_data.dwd_yd_shop_user_info t_su
ON t_od.user_id = t_su.user_id

其中,union的第一个查询有39条数据,第二个查有0条数据,union后也是39条数据,但是join了t_su后,数据变为了579行。dwd_yd_shop_user_info 表中按user_id分组发现没有重复数据。单独用dwd_yd_shop_user_info 与第一个子查询进行join也是39行是正确的。

explain costs sql 的结果请您补充下

PLAN FRAGMENT 0(F05)
Output Exprs:84: cast | 85: expr | 86: payment_amount | 123: user_id | 124: shop_id | 125: province_code | 126: province_name | 127: city_code | 128: city_name
Input Partition: UNPARTITIONED
RESULT SINK

13:EXCHANGE
cardinality: 702347

PLAN FRAGMENT 1(F00)

Input Partition: RANDOM
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 13

12:Project
| output columns:
| 84 <-> [84: cast, VARCHAR(65533), true]
| 85 <-> [85: expr, TINYINT, false]
| 86 <-> [86: payment_amount, INT, true]
| 123 <-> [123: user_id, LARGEINT, false]
| 124 <-> [124: shop_id, LARGEINT, true]
| 125 <-> [125: province_code, INT, true]
| 126 <-> [126: province_name, VARCHAR, true]
| 127 <-> [127: city_code, INT, true]
| 128 <-> [128: city_name, VARCHAR, true]
| cardinality: 702347
| column statistics:
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * user_id–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
| * shop_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
|
11:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| equal join conjunct: [130: cast, DOUBLE, true] = [129: cast, DOUBLE, true]
| build runtime filters:
| - filter_id = 0, build_expr = (129: cast), remote = false
| cardinality: 702347
| column statistics:
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * user_id–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
| * shop_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
|
|----10:EXCHANGE
| cardinality: 13217
|
1:Project
| output columns:
| 123 <-> [123: user_id, LARGEINT, false]
| 124 <-> [124: shop_id, LARGEINT, true]
| 125 <-> [125: province_code, INT, true]
| 126 <-> [126: province_name, VARCHAR, true]
| 127 <-> [127: city_code, INT, true]
| 128 <-> [128: city_name, VARCHAR, true]
| 130 <-> cast([123: user_id, LARGEINT, false] as DOUBLE)
| cardinality: 702347
| column statistics:
| * user_id–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
| * shop_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * province_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * city_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
|
0:OlapScanNode
table: dwd_yd_shop_user_info, rollup: dwd_yd_shop_user_info
preAggregation: on
partitionsRatio=1/1, tabletsRatio=6/6
tabletList=72611,72615,72619,72623,72627,72631
actualRows=702347, avgRowSize=37.0
cardinality: 702347
probe runtime filters:
- filter_id = 0, probe_expr = (CAST(123: user_id AS DOUBLE))
column statistics:
* user_id–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE
* shop_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* province_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* province_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* city_code–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* city_name–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* cast–>[9.7859750496651674E17, 1.5381502503702487E18, 0.0, 16.0, 583205.0] ESTIMATE

PLAN FRAGMENT 2(F01)

Input Partition: RANDOM
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 10

9:Project
| output columns:
| 84 <-> [84: cast, VARCHAR(65533), true]
| 85 <-> [85: expr, TINYINT, false]
| 86 <-> [86: payment_amount, INT, true]
| 129 <-> cast([84: cast, VARCHAR(65533), true] as DOUBLE)
| cardinality: 13217
| column statistics:
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
|
2:UNION
| child exprs:
| [82, VARCHAR(65533), true] | [83, TINYINT, false] | [11, INT, true]
| [121, VARCHAR, true] | [122, TINYINT, false] | [95, INT, true]
| pass-through-operands: all
| cardinality: 13217
| column statistics:
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
|
|----8:EXCHANGE
| cardinality: 360
|
5:EXCHANGE
cardinality: 12857

PLAN FRAGMENT 3(F03)

Input Partition: RANDOM
OutPut Partition: RANDOM
OutPut Exchange Id: 08

7:Project
| output columns:
| 95 <-> [95: payment_amount, INT, true]
| 121 <-> [121: user_id, VARCHAR, true]
| 122 <-> 1
| cardinality: 360
| column statistics:
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * user_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
|
6:OlapScanNode
table: ods_sub_order, rollup: ods_sub_order
preAggregation: off. Reason: Has can not pre-aggregation Join
Predicates: [112: created_time, LARGEINT, true] >= cast(cast(unix_timestamp[(cast(‘2022-05-16 00:00:00.000’ as DATETIME)); args: DATETIME; result: INT; args nullable: true; result nullable: true] as BIGINT) * 1000 as LARGEINT), [112: created_time, LARGEINT, true] <= cast(cast(unix_timestamp[(cast(‘2022-05-16 23:59:59.999’ as DATETIME)); args: DATETIME; result: INT; args nullable: true; result nullable: true] as BIGINT) * 1000 as LARGEINT), [100: pay_status, TINYINT, false] = 2
partitionsRatio=1/1, tabletsRatio=1/1
tabletList=11949
actualRows=2879, avgRowSize=5.0
cardinality: 360
column statistics:
* payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* pay_status–>[2.0, 2.0, 0.0, 1.0, 1.0] UNKNOWN
* created_time–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* user_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE

PLAN FRAGMENT 4(F02)

Input Partition: RANDOM
OutPut Partition: RANDOM
OutPut Exchange Id: 05

4:Project
| output columns:
| 11 <-> [11: payment_amount, INT, true]
| 82 <-> cast([2: buyer_id, LARGEINT, true] as VARCHAR(65533))
| 83 <-> 1
| cardinality: 12857
| column statistics:
| * payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
| * expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
|
3:OlapScanNode
table: ods_sub_order, rollup: ods_sub_order
preAggregation: on
Predicates: [37: created_time, LARGEINT, true] >= cast(cast(unix_timestamp[(cast(‘2022-05-16 00:00:00.000’ as DATETIME)); args: DATETIME; result: INT; args nullable: true; result nullable: true] as BIGINT) * 1000 as LARGEINT), [37: created_time, LARGEINT, true] <= cast(cast(unix_timestamp[(cast(‘2022-05-16 23:59:59.999’ as DATETIME)); args: DATETIME; result: INT; args nullable: true; result nullable: true] as BIGINT) * 1000 as LARGEINT), [21: pay_status, TINYINT, false] = 2
partitionsRatio=1/1, tabletsRatio=1/1
tabletList=11764
actualRows=102855, avgRowSize=6.0
cardinality: 12857
column statistics:
* buyer_id–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* payment_amount–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* pay_status–>[2.0, 2.0, 0.0, 1.0, 1.0] UNKNOWN
* created_time–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* cast–>[-Infinity, Infinity, 0.0, 1.0, 1.0] UNKNOWN
* expr–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE

已经在下文补充,烦请您指导一下。

profile 请您也发一下,定位下是否是已经修复的bug

抱歉,profile 是指什么?
版本的话用的是2.0.0-GA。

profile.txt (44.7 KB)

dwd_yd_shop_user_info这个表和第二个子查询join结果对不对,可能是join的类型转成了double,结果不稳定。或者join条件改成这个试试:ON cast(t_od.user_id as string) = cast(t_su.user_id as string)

是的是的,使用ON cast(t_od.user_id as string) = cast(t_su.user_id as string)这种方式解决了问题。具体原因我再研究一下。
十分感谢大佬的指导!

这个版本不是release版本,您可以升级到2.0最新的版本,2.0作为一个LTS稳定版本会修复一些已知问题

好的,我们尽快升级。