StarRocks planner use long time 3000 ms

【详述】
在执行如下查询时,

SELECT
    date_id,
    COALESCE(SUM(value), 0) AS summary
FROM
    fact_tbl
WHERE
    project_id IN %(project_ids)s AND
    date_id >= 20220505 AND
    date_id <= 20220510
GROUP BY
    date_id;

其中参数project_ids允许用户传入多个整型值。

  • 当输入100个ID值时,查询正常
  • 但当ID数量加一,即101个时,遭遇如下错误
ERROR 1064 (HY000): StarRocks planner use long time 3000 ms, This probably because

1. FE Full GC,
2. Hive external table fetch metadata took a long time,
3. The SQL is very complex.

You could

1. adjust FE JVM config,
2. try query again,
3. enlarge new_planner_optimize_timeout session variable

【背景】
表结构如下

CREATE TABLE `fact_tbl` (
    `version_id` varchar(128) NOT NULL,
    `item_id` varchar(128) NOT NULL,
    `date_id` int NOT NULL,
    `insertions` int NOT NULL,
    `project_id` int NOT NULL
) ENGINE=OLAP
DUPLICATE KEY (
    `version_id`,
    `item_id`
)
COMMENT "test fact table"
PARTITION BY RANGE (`project_id`) (
    around 5000 partitions
)
DISTRIBUTED BY HASH (
    `version_id`
) BUCKETS 8
PROPERTIES (
    "replication_num" = "1",
    "in_memory" = "false",
    "storage_format" = "DEFAULT"
);

表中数据量级如下

  • size: 2.422 GB
  • ReplicaCount: 36008
  • rows: 14,007,272

fe JVM配置如下

JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$STARROCKS_HOME/log/fe.gc.log.$DATE"

# For jdk 9+, this JAVA_OPTS will be used as default JVM options
JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:$STARROCKS_HOME/log/fe.gc.log.$DATE:time"

【StarRocks版本】2.2.5

什么版本?几fe几be?是否混部?请提供下基础信息。是偶现这个错误吗?还是您这边测试发现只要是101的时候就会报这个错误?机器内存多大?grep -i full log/fe.gc* 看下有没有发生fullgc?

@JiangLai
【StarRocks版本】2.2.5
【集群规模】1fe + 1be(fe与be混部)
【机器信息】内存32G,CPU8核
【报错状态】
错误不是偶发,只要声明的ID数量达到101,就会报错;100及其以下都不会报错,而且查询耗时在300ms左右。

查看log/fe.gc*,没有信息返回。

好的了解,方便发下fact_tbl的建表语句和explain costs +查询sql看下吗?

@JiangLai
好的。

  1. fact_tbl建表语句见主楼
CREATE TABLE `fact_tbl` (
    `version_id` varchar(128) NOT NULL,
    `item_id` varchar(128) NOT NULL,
    `date_id` int NOT NULL,
    `insertions` int NOT NULL,
    `project_id` int NOT NULL
) ENGINE=OLAP
DUPLICATE KEY (
    `version_id`,
    `item_id`
)
COMMENT "test fact table"
PARTITION BY RANGE (`project_id`) (
    PARTITION p1 VALUES [("1"), ("2")),
    ...
    PARTITION p4556 VALUES [("4556"), ("4557")))  -- around 5000 partitions
)
DISTRIBUTED BY HASH (
    `version_id`
) BUCKETS 8
PROPERTIES (
    "replication_num" = "1",
    "in_memory" = "false",
    "storage_format" = "DEFAULT"
);
  1. 执行EXPLAIN COSTS SQL时,会抛出主楼中说明的错误,拿不到相应的执行计划。
ERROR 1064 (HY000): StarRocks planner use long time 3000 ms, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable

你把这个参数调大然后跑下看看呢?new_planner_optimize_timeout

@JiangLai
好的,我把参数调至10000(10s),执行EXPLAIN后的结果如下:

执行计划里显示,当ID数量从100上升到101后,scan node的范围变化较大

  • ID数量为100时
partitionsRatio=99/4501, tabletsRatio=792/792
tabletList=41482,41484,41486,41488,41490,41492,41494,41496,32268,32270 ...
actualRows=698769, avgRowSize=4.0
cardinality: 87346
  • ID数量为101时
partitionsRatio=4501/4501, tabletsRatio=36008/36008
tabletList=23955,23957,23959,23961,23963,23965,23967,23969,23972,23974 ...
actualRows=14007272, avgRowSize=4.0
cardinality: 1750909

101的时候没有分区分桶裁剪了,相当于全表扫描,我刚仔细看了你的建表语句,你怎么建了5000个分区… 我看你写的数据量才2G多,你这样建表很不合理呀,分区一般建议根据日期去做,分区作用主要是方便数据管理。为啥在100临界点导致这个结果还得看下你那个对应的分区怎么写的。

@JiangLai
感谢!现在遇到的完整情况是这样,

  1. 5000个按照数字进行分区的场景,是在做实验时的建表策略。确实因为如您提到的,分区数量太多并不合理,所以后续真正使用的,是按照日期进行分区,分区数量为60左右(并采用动态分区);
  2. 在5000个数字分区下,遇到的问题并不是查询分区键值为101时会遇到问题,因为当我令被查询的ID值随机时,依然会遇到100个正常、101个全表扫描的现象。如果是这样的话,是不是就和您提到的“分区怎么写的”没有关联?
  3. 如果参考执行计划,那应该是真正在BE扫描数据的时候因为全表扫描而增加耗时。那为什么现象是FE Planner这个地方耗时就增加了,而且从100个ID到101个ID,仅多了一个,就阶跃地出现这一情况。这里不是特别理解 :thinking:
  4. 最后,目前发现在按时间分区的表中,也遭遇了StarRocks planner use long time 3000 ms的问题。建表及查询语句如下:
  • 建表语句
CREATE TABLE IF NOT EXISTS `fact_beta_tbl` (
    `version_id`           VARCHAR(128) NOT NULL,
    `_ptn_dt`              DATE NOT NULL COMMENT "Private field as data partition key",
    `item_id`              VARCHAR(128) NOT NULL COMMENT "Identifier of record among a version",
    `record_id`            VARCHAR(128) NOT NULL COMMENT "Another dimension field",
    ...  -- there are around other 30 columns
    `iid`                  INT NOT NULL DEFAULT '1' COMMENT "Identifier of records among a version as sort key"
) ENGINE=OLAP
DUPLICATE KEY (
    `version_id`
)
PARTITION BY RANGE (`_ptn_dt`) (
    START ("@ptn_init_start") END ("@ptn_init_end") EVERY (INTERVAL 1 day)  -- here are variables, there would be around 60 partitions
)
DISTRIBUTED BY HASH (
    `version_id`
) BUCKETS 8
PROPERTIES (
    "replication_num" = "1",
    "in_memory" = "false",
    "storage_format" = "DEFAULT",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-60",
    "dynamic_partition.end" = "5",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "8"
);
  • 查询语句
SELECT
    fact_beta_tbl.iid AS iid,
    fact_beta_tbl.item_id AS item_id,
    ...,
    fact_beta_tbl.earliest_record_id AS earliest_record_id,     -- 这里有三个同质的record_id字段,值不同,但都可以联结到fact_alpha_tbl表中的record_id字段
    earliest_tbl.author_time AS earliest_author_time,
    fact_beta_tbl.frequent_record_id AS frequent_record_id,
    frequent_tbl.author_time AS frequent_author_time,
    fact_beta_tbl.latest_record_id AS latest_record_id,
    latest_tbl.author_time AS latest_author_time
FROM
    fact_beta_tbl AS fact_beta_tbl
LEFT JOIN
    fact_alpha_tbl AS earliest_tbl
ON
    earliest_tbl.version_id = fact_beta_tbl.version_id AND
    earliest_tbl.record_id = fact_beta_tbl.earliest_record_id
LEFT JOIN
    fact_alpha_tbl AS frequent_tbl
ON
    frequent_tbl.version_id = fact_beta_tbl.version_id AND
    frequent_tbl.record_id = fact_beta_tbl.frequent_record_id
LEFT JOIN
    fact_alpha_tbl AS latest_tbl
ON
    latest_tbl.version_id = fact_beta_tbl.version_id AND
    latest_tbl.record_id = fact_beta_tbl.latest_record_id
WHERE
    fact_beta_tbl.item_id IS NOT NULL AND
    fact_beta_tbl.version_id = %(uuid_value)s
ORDER BY
    fact_beta_tbl.iid ASC
LIMIT 500
OFFSET 1000

这个查询的超时并不能稳定复现,其执行计划如下(该执行计划耗时较短):join_execution_plan.log (168.7 KB)

联结的两张表数据大小如下

  • fact_beta_tbl
    • row_count: 307,893,150
    • disk_usage: 46GB
    • ReplicaCount: 512
  • fact_alpha_tbl(被联结的表)
    • row_count: 35,292,058
    • disk_usage: 4.9GB
    • ReplicaCount: 512

@JiangLai hello,关于最新回复中不能稳定复现的超时情况,不知道有没有什么建议的排查思路?