主键表，通过spark connector导入历史数据，通过分区键查询不出数据

wwgsn · 2023年09月15日 10:52

【背景】一张主键表，通过spark connector导入历史数据，通过分区键查询不出数据，但是通过其他时间字段却能查出数据
【业务影响】无
【StarRocks版本】3.1.2,存算分离模式
【集群规模】1fe+6be（独立部署）
【机器信息】256G/128C 万兆网卡
【表模型】主键模型
【导入方式】spark connector
【联系方式】timedifier@126.com

建表语句：
CREATE TABLE dwd_nwp_transaction_fact_invoicing_i (
invoice_info_id bigint(20) NOT NULL COMMENT “”,
invoice_year int(11) NOT NULL COMMENT “开票年”,
invoice_month int(11) NOT NULL COMMENT “开票月”,
invoice_day int(11) NOT NULL COMMENT “开票天”,
invoice_date datetime NOT NULL COMMENT “开票时间”,
…
) ENGINE=OLAP
PRIMARY KEY(invoice_info_id, invoice_year, invoice_month)
PARTITION BY (invoice_year,invoice_month)
DISTRIBUTED BY HASH(invoice_info_id)
PROPERTIES (
“replication_num” = “1”,
“datacache.partition_duration” = “1 months”,
“datacache.enable” = “true”,
“storage_volume” = “builtin_storage_volume”,
“enable_async_write_back” = “false”,
“enable_persistent_index” = “false”,
“compression” = “LZ4”
);

查不出数据：
select * from dwd_nwp_transaction_fact_invoicing_i
where invoice_year=2023.0 and invoice_month=9.0 limit 10

能查出数据：
select * from dwd_nwp_transaction_fact_invoicing_i
where year(invoice_date)=2023 and month(invoice_date)=9 limit 10

jingdan · 2023年09月15日 11:25

这个执行计划explain costs + sql给下，看起来应该是cast转换有问题

wwgsn · 2023年09月16日 08:50

这是我写错了，看截图里是用的整型的，explain里面第一部scan的时候，命中分区为0

wwgsn · 2023年09月18日 01:42

explain costs
SELECT
count(1)
FROM
dwd_nwp_transaction_fact_invoicing_i
WHERE
invoice_year = 2023
AND invoice_month = 9
LIMIT 10
------------------------------------------------分割线------------------------------------------------------------------

PLAN FRAGMENT 0(F01)
Output Exprs:101: count
Input Partition: UNPARTITIONED
RESULT SINK

3:EXCHANGE
distribution type: GATHER
limit: 10
cardinality: 1

PLAN FRAGMENT 1(F00)

Input Partition: RANDOM
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 03

2:AGGREGATE (update finalize)
| aggregate: count[(1); args: TINYINT; result: BIGINT; args nullable: false; result nullable: false]
| limit: 10
| cardinality: 1
| column statistics:
| * count–>[0.0, 1.0, 0.0, 8.0, 1.0] ESTIMATE
|
1:Project
| output columns:
| 103 <-> 1
| cardinality: 1
| column statistics:
| * auto_fill_col–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
|
0:OlapScanNode
table: dwd_nwp_transaction_fact_invoicing_i, rollup: dwd_nwp_transaction_fact_invoicing_i
preAggregation: on
Predicates: [2: invoice_year, INT, false] = 2023, [3: invoice_month, INT, false] = 9
partitionsRatio=0/118, tabletsRatio=0/0
tabletList=
actualRows=0, avgRowSize=9.0
cardinality: 1
column statistics:
* invoice_year–>[2023.0, 2023.0, 0.0, 4.0, 10.0] ESTIMATE
* invoice_month–>[9.0, 9.0, 0.0, 4.0, 12.0] ESTIMATE
* auto_fill_col–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE

jingdan · 2023年09月18日 01:58

能查出来数据的explain costs也发下

wwgsn · 2023年09月18日 02:07

explain costs
select count(1) from dwd_nwp_transaction_fact_invoicing_i
where year(invoice_date)=2023 and month(invoice_date)=9 limit 10

-----------------------------------------------分割线-----------------------------------------------------------
PLAN FRAGMENT 0(F01)
Output Exprs:101: count
Input Partition: UNPARTITIONED
RESULT SINK

4:AGGREGATE (merge finalize)
| aggregate: count[([101: count, BIGINT, false]); args: TINYINT; result: BIGINT; args nullable: true; result nullable: false]
| limit: 10
| cardinality: 1
| column statistics:
| * count–>[0.0, 142684.44166666665, 0.0, 8.0, 1.0] ESTIMATE
|
3:EXCHANGE
distribution type: GATHER
cardinality: 1

PLAN FRAGMENT 1(F00)

Input Partition: RANDOM
OutPut Partition: UNPARTITIONED
OutPut Exchange Id: 03

2:AGGREGATE (update serialize)
| aggregate: count[(1); args: TINYINT; result: BIGINT; args nullable: false; result nullable: false]
| cardinality: 1
| column statistics:
| * count–>[0.0, 142684.44166666665, 0.0, 8.0, 1.0] ESTIMATE
|
1:Project
| output columns:
| 103 <-> 1
| cardinality: 142684
| column statistics:
| * auto_fill_col–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE
|
0:OlapScanNode
table: dwd_nwp_transaction_fact_invoicing_i, rollup: dwd_nwp_transaction_fact_invoicing_i
preAggregation: on
Predicates: year[([20: invoice_date, DATETIME, false]); args: DATETIME; result: SMALLINT; args nullable: false; result nullable: false] = 2023, month[([20: invoice_date, DATETIME, false]); args: DATETIME; result: TINYINT; args nullable: false; result nullable: false] = 9
partitionsRatio=117/118, tabletsRatio=315/315
tabletList=16213,16215,16216,16217,16218,16219,16220,16221,16222,16223 …
actualRows=17086741, avgRowSize=9.0
cardinality: 142684
column statistics:
* invoice_date–>[1.388592E9, 1.69476294E9, 0.0, 8.0, 142684.44166666665] ESTIMATE
* auto_fill_col–>[1.0, 1.0, 0.0, 1.0, 1.0] ESTIMATE

U_1670552093746_8532 · 2023年09月18日 05:45

show partitions from dwd_nwp_transaction_fact_invoicing_i; 麻烦贴下这个执行结果

wwgsn · 2023年09月18日 05:56

U_1670552093746_8532 · 2023年09月19日 02:56

麻烦看下 show partitions 结果里 list一列有没有 (2023, 9)的值. 查一下select * from dwd_nwp_transaction_fact_invoicing_i
where invoice_year=2022 and invoice_month=9 limit 10 能出结果吗.

wwgsn · 2023年09月19日 06:45

分区显示有数据（图1），但是按分区键查不出数据（图2）

U_1670552093746_8532 · 2023年09月19日 07:14

麻烦执行 select * from dwd_nwp_transaction_fact_invoicing_i partitions (p2023_9);看看写入到分区的数据是什么. 怀疑spark connector 写入的时候进错分区了?

wwgsn · 2023年09月19日 07:58

invoice_year和invoice_month都是正确的

U_1670552093746_8532 · 2023年09月19日 08:14

方便的话加下我微信 17306448687?

yaxxtalon · 2023年09月26日 11:41

请问这个问题解决了吗？我正常的用int类型做分区键，插入成功，但是通过分区键查询也是不出数据，explain显示是没走分区的。和这位朋友的问题一样。

U_1670552093746_8532 · 2023年10月7日 08:22

请问你使用的也是存算分离吗? 可以把你的环境信息, 导入设置贴下吗? 主贴这位客户后来再测试没有复现.