SELECT x FROM db.table Tablet(xxx) LIMIT 20 报 ERROR 1064 (HY000): Invalid tablet id

【详述】SELECT x FROM db.table Tablet(xxx) LIMIT 20 报 ERROR 1064 (HY000): Invalid tablet id: ‘10178’
【背景】每日的统计报错 ,我截取了sql ,发现错误是来自SELECT x FROM db.table Tablet(xxx) LIMIT 20;
表是使用flink 导入 ,一天大概几百万 ,根据代码分析 ,应该是这两个issue引入的问题
[Enhancement] reduce the loop time in building scanNode (backport #16236) (#16300)
[BugFix] fixup mistake of access tablet(tablet_id,…) (#16390)

原因是
MaterializedViewRule.transform 中会调用selectBestIndexes 中的selectBestRowCountIndex ,这是根据rowcount去选 ,
如果rollup 表刚好rowcount比base index少 ,那就会选rollup 表的index (但也不一定每个字段查的都会比较少 ,我是使用flink导入 ,我也试了好几个字段才出现的) ,到了新引入的OlapScanNode.mapTabletsToPartitions ,由于对应的不是base index ,就会报错。

这个应该如何解决?

表如下:
CREATE TABLE table (
id varchar(65533) NULL COMMENT “”,
id2 varchar(65533) NULL COMMENT “”,
partDate date NULL COMMENT “”,
time1 decimal64(10, 0) NULL COMMENT “”,
time2 decimal64(10, 0) NULL COMMENT “”,
time3 decimal64(10, 0) NULL COMMENT “”,
time4 varchar(65533) NULL COMMENT “”,
time5 float SUM NULL DEFAULT “0” COMMENT “”,
id2 bitmap BITMAP_UNION NULL COMMENT “”,
id3 hll HLL_UNION NULL COMMENT “”,
num1 bigint(20) SUM NULL DEFAULT “0” COMMENT “”
) ENGINE=OLAP
AGGREGATE KEY(id, id2, partDate, time1, time2, time3, time4)
COMMENT “OLAP”
PARTITION BY RANGE(partDate)
(
START (“2023-03-10”) END (“2023-03-20”) EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(id, id2) BUCKETS 10
PROPERTIES (
“replication_num” = “3”,
“bloom_filter_columns” = “id2, id”,
“dynamic_partition.enable” = “true”,
“dynamic_partition.time_unit” = “DAY”,
“dynamic_partition.time_zone” = “Asia/Shanghai”,
“dynamic_partition.start” = “-10”,
“dynamic_partition.end” = “5”,
“dynamic_partition.prefix” = “p”,
“dynamic_partition.buckets” = “10”,
“dynamic_partition.replication_num” = “3”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”
);

create materialized view mv1 as select id,id2,partDate,time4,time1,time2,time3,SUM(time5),HLL_UNION(id3),SUM(num1) from table group by id,id2,partDate,time4,time1,time2,time3;
create materialized view mv2 as select id,id2,partDate,time4,time1,time2,time3,HLL_UNION(id3) from table group by id,id2,partDate,time4,time1,time2,time3;
create materialized view mv3 as select id,id2,partDate from table group by id,id2,partDate;

日志如下:
2023-03-12 07:48:22,696 WARN (AutoStatistic|35) [StmtExecutor.execute():557] execute Exception, sql INSERT INTO table_statistic_v1 SELECT
ava.lang.NullPointerException: null
at com.starrocks.planner.OlapScanNode.mapTabletsToPartitions(OlapScanNode.java:718) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.planner.OlapScanNode.updateScanInfo(OlapScanNode.java:660) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalOlapScan(PlanFragmentBuilder.java:545) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalOlapScan(PlanFragmentBuilder.java:259) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.optimizer.operator.physical.PhysicalOlapScanOperator.accept(PhysicalOlapScanOperator.java:132) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:268) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalDistribution(PlanFragmentBuilder.java:1371) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalDistribution(PlanFragmentBuilder.java:259) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.optimizer.operator.physical.PhysicalDistributionOperator.accept(PhysicalDistributionOperator.java:44) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:268) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalLimit(PlanFragmentBuilder.java:2323) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalLimit(PlanFragmentBuilder.java:259) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.optimizer.operator.physical.PhysicalLimitOperator.accept(PhysicalLimitOperator.java:33) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:268) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalHashAggregate(PlanFragmentBuilder.java:1135) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalHashAggregate(PlanFragmentBuilder.java:259) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.optimizer.operator.physical.PhysicalHashAggregateOperator.accept(PhysicalHashAggregateOperator.java:149) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:268) ~[fe-core-2.3.0.jar:2.3.0]

以及
2023-03-12 07:48:22,709 WARN (AutoStatistic|35) [StatisticAutoCollector.collectStatistics():133] Statistic collect work job: 15253, type: SAMPLE, db: default_cluster:db, table: table. throw exception.
com.starrocks.common.DdlException:
at com.starrocks.statistic.StatisticExecutor.collectStatisticSync(StatisticExecutor.java:246) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.statistic.StatisticExecutor.sampleCollectStatisticSync(StatisticExecutor.java:224) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.statistic.StatisticAutoCollector$TableCollectJob.tryCollect(StatisticAutoCollector.java:45) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.statistic.StatisticAutoCollector$TableCollectJob.access$100(StatisticAutoCollector.java:34) ~[fe-core-2.3.0.jar:2.3.0]
at com.starrocks.statistic.StatisticAutoCollector.collectStatistics(StatisticAutoCollector.java:129) [fe-core-2.3.0.jar:2.3.0]
at com.starrocks.statistic.StatisticAutoCollector.runAfterCatalogReady(StatisticAutoCollector.java:72) [fe-core-2.3.0.jar:2.3.0]
at com.starrocks.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:61) [fe-core-2.3.0.jar:2.3.0]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) [fe-core-2.3.0.jar:2.3.0]

【业务影响】cbo统计
【StarRocks版本】2.3.10
【集群规模】例如:3fe +3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:8C/32G/?

是有问题,我们修复下

请问应该如何修复呢?

请问有任何进展吗?

研发排期还没排上,优先级比较低,你有时间帮我们Fix下吗?

我目前先回滚了 ,主要是cbo这块优化不太清楚 ,怕改了会影响其他sql , 等有时间好好学习这块

请问有进展吗? 我门虽然回退代码了 ,没有报错 ,但只是变成每日统计的任务中 , 这张表没数据而已 ,没法进行CBO优化

另外想请问为什么rollup 表可能rowcount比base index少, 我印象中rollup 不是同步的吗?
我自己问了ai ,ai给出的修复方案是以下 ,请问说得对吗?

出现materialized view中的rowcount少于base table的情况,主要是因为materialized view的数据还没有完全更新,还处于"rollup state"状态。

所以,修复这个问题的主要方案应该是:

1. 等待materialized view的数据更新完成,状态变为"ready"后,再执行查询。这时materialized view和base table的rowcount就会一致,查询优化器选索引时就不会出现误选的情况。
2. 在查询优化器选索引阶段,判断materialized view的状态。如果不是"ready"状态,则不考虑这个materialized view,直接使用base table。这样可以避免因为materialized view数据未更新完全而导致的误选索引的问题。
3. 在生成OlapScanNode时,同样判断materialized view的状态。如果不是"ready"状态,则不使用这个materialized view,直接fallback到base table。
4. 限制MaterializedViewRule只在materialized view状态为"ready"时生效。这样可以保证materialized view的数据一定已经和base table一致,避免出现rowcount少导致的误选索引问题。

所以,您说得对,我前面的建议并不完全正确。修复这个问题,主要还是要考虑materialized view的数据更新机制,确保在使用materialized view时,其数据一定和base table保持一致。

我再催促下研发同学,排下期

好的 有结论的话请说下