服务器被自动重启

【详述】从版本2.4.1升级至2.5.3版本。使用异步物化视图功能之后。3个节点服务器,随机出现自动重启故障。服务器为VMware虚拟机。从VMware管理平台可以看到报错:客户机操作系统已禁用该CPU。请关闭电源或重置虚拟机。


【业务影响】服务查询很慢。偶尔出现无法查询。
【StarRocks版本】2.5.3
【集群规模】例如:3fe(3 follower)+ 3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡8C/16G/万兆
【联系方式】
【附件】

  • fe.log/beINFO/相应截图
  • 慢查询:
    • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
      image
    • pipeline是否开启:show variables like ‘%pipeline%’;
      image
    • fe/be节点cpu和内存使用率截图,在重启时间端前后情况:
      下线时间:

      堆内存:
      image
      qps:
      image
      cpu:
      image
      be内存
      image

fe.out 重启前后时间日志:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/data/server/fe/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/server/fe/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/server/fe/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
[2023-04-07 14:13:58] notify new FE type transfer: UNKNOWN
[2023-04-07 14:13:58] notify new FE type transfer: FOLLOWER
log4j:WARN No appenders could be found for logger (velocity).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
四月 07, 2023 2:14:48 下午 com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncBulkCompleter accept
警告: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172)
Caused by: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at com.starrocks.sql.optimizer.task.SeriallyTaskScheduler.executeTasks(SeriallyTaskScheduler.java:38)
at com.starrocks.sql.optimizer.Optimizer.ruleRewriteIterative(Optimizer.java:479)
at com.starrocks.sql.optimizer.Optimizer.logicalRuleRewrite(Optimizer.java:220)
at com.starrocks.sql.optimizer.Optimizer.rewriteAndValidatePlan(Optimizer.java:324)
at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:132)
at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:93)
at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95)
at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66)
at com.starrocks.statistic.StatisticExecutor.executeDQL(StatisticExecutor.java:239)
at com.starrocks.statistic.StatisticExecutor.queryStatisticSync(StatisticExecutor.java:83)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.queryStatisticsData(ColumnBasicStatsCacheLoader.java:111)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.lambda$asyncLoadAll$1(ColumnBasicStatsCacheLoader.java:77)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
… 5 more

四月 07, 2023 2:14:48 下午 com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncBulkCompleter accept
警告: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172)
Caused by: com.starrocks.sql.common.StarRocksPlannerException: StarRocks planner use long time 3000 ms in logical phase, This probably because 1. FE Full GC, 2. Hive external table fetch metadata took a long time, 3. The SQL is very complex. You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable
at com.starrocks.sql.optimizer.task.SeriallyTaskScheduler.executeTasks(SeriallyTaskScheduler.java:38)
at com.starrocks.sql.optimizer.Optimizer.ruleRewriteIterative(Optimizer.java:479)
at com.starrocks.sql.optimizer.Optimizer.logicalRuleRewrite(Optimizer.java:220)
at com.starrocks.sql.optimizer.Optimizer.rewriteAndValidatePlan(Optimizer.java:324)
at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:132)
at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:93)
at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95)
at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66)
at com.starrocks.statistic.StatisticExecutor.executeDQL(StatisticExecutor.java:239)
at com.starrocks.statistic.StatisticExecutor.queryStatisticSync(StatisticExecutor.java:83)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.queryStatisticsData(ColumnBasicStatsCacheLoader.java:111)
at com.starrocks.sql.optimizer.statistics.ColumnBasicStatsCacheLoader.lambda$asyncLoadAll$1(ColumnBasicStatsCacheLoader.java:77)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
… 5 more

You could 1. adjust FE JVM config, 2. try query again, 3. enlarge new_planner_optimize_timeout session variable

BE也重启了吗?麻烦看一下be.out

fe/be混布。be.out无异常日志。

VMWare还混部,生产环境不建议这样用哦。

不是生产环境,是开发环境。