FE小版本 2.5.0 > 2.5.13升级故障

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】FE升级 2.5.0 > 2.5.13 后,启动报错
【背景】FE升级 2.5.0 > 2.5.13
【业务影响】
【StarRocks版本】例如:2.5.13
【集群规模】例如:1fe 2.5.0 +4be 2.5.13
【机器信息】CPU虚拟核/内存/网卡,例如:24C/48G/万兆
【联系方式】VX群
Oct 10, 2023 7:35:31 PM com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncBulkCompleter accept
WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: com.starrocks.sql.analyzer.SemanticException: Statistics query fail | Error Message [INTERNAL_ERROR] | QueryId [1e9a47fe-6761-11ee-86da-00505690ccaa] | SQL [SELECT cast(1 as INT), now(), db_id, table_id, column_name, sum(row_count), cast(sum(data_size) as bigint), hll_union_agg(ndv), sum(null_count), cast(max(cast(max as varchar(65533))) as string), cast(min(cast(min as varchar(65533))) as string) FROM column_statistics WHERE table_id = 1336920 and column_name = “req_id” GROUP BY db_id, table_id, column_name UNION ALL SELECT cast(1 as INT), now(), db_id, table_id, column_name, sum(row_count), cast(sum(data_size) as bigint), hll_union_agg(ndv), sum(null_count), cast(max(cast(max as float)) as string), cast(min(cast(min as float)) as string) FROM column_statistics WHERE table_id = 1336920 and column_name = “req_qty” GROUP BY db_id, table_id, column_name UNION ALL SELECT cast(1 as INT), now(), db_id, table_id, column_name, sum(row_count), cast(sum(data_size) as bigint), hll_union_agg(ndv), sum(null_count), cast(max(cast(max as varchar(65533))) as string), cast(min(cast(min as varchar(65533))) as string) FROM column_statistics WHERE table_id = 1336920 and column_name = “supply_type” GROUP BY db_id, table_id, column_name UNION ALL SELECT cast(1 as INT), now(), db_id, table_id, column_name, sum(row_count), cast(sum(data_size) as bigint), hll_union_agg(ndv), sum(null_count), cast(max(cast(max as varchar(65533))) as string), cast(min(cast(min as varchar(65533))) as string) FROM column_statistics WHERE table_id = 1336920 and column_name = “supply_id” GROUP BY db_id, table_id, column_name UNION ALL SELECT cast(1 as INT), now(), db_id, table_id, column_name, sum(row_count), cast(sum(data_size) as bigint), hll_union_agg(ndv), sum(null_count), cast(max(cast(max as float)) as string), cast(min(cast(min

回滚2.5.0后启动报错:
Oct 11, 2023 9:19:23 AM com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncBulkCompleter accept
WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at com.starrocks.sql.optimizer.task.SeriallyTaskScheduler.executeTasks(SeriallyTaskScheduler.java:37)
at com.starrocks.sql.optimizer.Optimizer.ruleRewriteIterative(Optimizer.java:460)
at com.starrocks.sql.optimizer.Optimizer.logicalRuleRewrite(Optimizer.java:223)
at com.starrocks.sql.optimizer.Optimizer.rewriteAndValidatePlan(Optimizer.java:314)
at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:131)
at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:92)
at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95)
at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66)
at com.starrocks.statistic.StatisticExecutor.executeDQL(StatisticExecutor.java:239)
at com.starrocks.statistic.StatisticExecutor.queryStatisticSync(StatisticExecutor.java:83)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.queryStatisticsData(ColumnBasicStatsCacheLoader.java:111)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.lambda$asyncLoadAll$1(ColumnBasicStatsCacheLoader.java:77)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
… 5 more

2.5.0-fe.log.tgz (41.1 MB)

2.5.13-fe.log.tgz (56.6 MB)

升级时操作顺序是先升级的be,然后升级follower,再升级leader么?fe.conf 发下

升级BE,再升级FE,无Follower。
2.5.0,2.5.13 fe.conf 配置一样
fe.conf (3.1 KB)

经回滚FE版本至2.5.0并调整Xmx至40GB,FE目前已经启动,但是性能极差。
另查image 已多日未生成新版本:
-rw-r–r--. 1 root root 822255288 Sep 14 19:15 image.113852772

问题经SR小鳄老师排查,已完美解决并升级至2.5.13 。 :smiley:

一直没做CheckPoint导致回放日志太慢,回放太慢的原因可能是操作系统开了Swap导致,但是为什么一直没做CheckPoint,日志已经被清掉了,不好确定原因,可能是FE老版本有问题导致。