FE频繁OOM

【详述】FE频繁OOM,数据量不大,十几个G的数据,两个BE节点4c32+8c32,一个FE节点4c32(和另一台BE共用)
【背景】有比较多的ROUTINE LOAD任务和很复杂的物化视图,但是都正常运行一天左右OOM
【业务影响】 是
【是否存算分离】否
【StarRocks版本】3.6.2
【集群规模】1fe+2be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【附件】
fe.log
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread “colocate group clone checker”

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread “export_exporting_job_scheduler_thread_pool-0”
Exception in thread “starrocks-mysql-nio I/O-3” java.lang.OutOfMemoryError: Java heap space
Exception in thread “Timer-2” java.lang.OutOfMemoryError: Java heap space
Exception in thread “thrift-server-pool-38818” java.lang.OutOfMemoryError: Java heap space
Exception in thread “pool-55-thread-1” java.lang.OutOfMemoryError: Java heap space
2024-10-09 07:06:11,038 leaderCheckpointer ERROR An exception occurred processing Appender SysWF org.apache.logging.log4j.core.appender.AppenderLoggingException: java.lang.OutOfMemoryError: Java heap space
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:165)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:134)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:125)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:89)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:683)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:641)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:624)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:560)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2040)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1907)
at org.apache.logging.log4j.spi.AbstractLogger.warn(AbstractLogger.java:2789)
at com.starrocks.persist.metablock.SRMetaBlockReader.close(SRMetaBlockReader.java:113)
at com.starrocks.server.GlobalStateMgr.loadImage(GlobalStateMgr.java:1651)
at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:207)
at com.starrocks.leader.Checkpoint.createImage(Checkpoint.java:193)
at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:108)
at com.starrocks.common.util.FrontendDaemon.runOneCycle(FrontendDaemon.java:72)
at com.starrocks.common.util.Daemon.run(Daemon.java:107)
Caused by: java.lang.OutOfMemoryError: Java heap space

Exception in thread “starrocks-taskrun-pool-14” java.lang.OutOfMemoryError: Java heap space
[2024-10-09 07:06:51] failed to begin txn after retried 3 times! db = CloseSafeDatabase{db=182241448}

gc.log
[2024-10-09T07:06:51.200+0800] GC(17102) Metaspace: 107317K(107840K)->107317K(107840K) NonClass: 97284K(97536K)->97284K(97536K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:51.200+0800] GC(17102) Pause Young (Normal) (G1 Evacuation Pause) 8048M->8048M(8192M) 2.811ms
[2024-10-09T07:06:51.200+0800] GC(17102) User=0.01s Sys=0.00s Real=0.00s
[2024-10-09T07:06:51.200+0800] Attempting full compaction
[2024-10-09T07:06:51.200+0800] GC(17103) Using 4 workers of 4 for full compaction
[2024-10-09T07:06:51.200+0800] GC(17103) Pause Full (G1 Compaction Pause)
[2024-10-09T07:06:51.200+0800] GC(17103) Phase 1: Mark live objects
[2024-10-09T07:06:51.733+0800] GC(17103) Phase 1: Mark live objects 532.651ms
[2024-10-09T07:06:51.733+0800] GC(17103) Phase 2: Prepare for compaction
[2024-10-09T07:06:51.739+0800] GC(17103) Phase 2: Prepare for compaction 5.921ms
[2024-10-09T07:06:51.739+0800] GC(17103) Phase 3: Adjust pointers
[2024-10-09T07:06:51.787+0800] GC(17103) Phase 3: Adjust pointers 48.309ms
[2024-10-09T07:06:51.787+0800] GC(17103) Phase 4: Compact heap
[2024-10-09T07:06:51.788+0800] GC(17103) Phase 4: Compact heap 0.657ms
[2024-10-09T07:06:51.821+0800] GC(17103) Eden regions: 0->0(102)
[2024-10-09T07:06:51.821+0800] GC(17103) Survivor regions: 0->0(0)
[2024-10-09T07:06:51.821+0800] GC(17103) Old regions: 2010->2009
[2024-10-09T07:06:51.821+0800] GC(17103) Archive regions: 2->2
[2024-10-09T07:06:51.821+0800] GC(17103) Humongous regions: 36->36
[2024-10-09T07:06:51.821+0800] GC(17103) Metaspace: 107317K(107840K)->107317K(107840K) NonClass: 97284K(97536K)->97284K(97536K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:51.821+0800] GC(17103) Pause Full (G1 Compaction Pause) 8048M->8045M(8192M) 620.808ms
[2024-10-09T07:06:51.821+0800] GC(17103) User=0.78s Sys=0.00s Real=0.62s
[2024-10-09T07:06:51.824+0800] GC(17104) Pause Young (Concurrent Start) (G1 Evacuation Pause)
[2024-10-09T07:06:51.824+0800] GC(17104) Using 4 workers of 4 for evacuation
[2024-10-09T07:06:51.827+0800] GC(17104) To-space exhausted
[2024-10-09T07:06:51.827+0800] GC(17104) Pre Evacuate Collection Set: 0.4ms
[2024-10-09T07:06:51.827+0800] GC(17104) Merge Heap Roots: 0.0ms
[2024-10-09T07:06:51.827+0800] GC(17104) Evacuate Collection Set: 1.3ms
[2024-10-09T07:06:51.827+0800] GC(17104) Post Evacuate Collection Set: 1.4ms
[2024-10-09T07:06:51.827+0800] GC(17104) Other: 0.2ms
[2024-10-09T07:06:51.827+0800] GC(17104) Eden regions: 1->0(102)
[2024-10-09T07:06:51.827+0800] GC(17104) Survivor regions: 0->0(0)
[2024-10-09T07:06:51.827+0800] GC(17104) Old regions: 2009->2010
[2024-10-09T07:06:51.827+0800] GC(17104) Archive regions: 2->2
[2024-10-09T07:06:51.827+0800] GC(17104) Humongous regions: 36->36
[2024-10-09T07:06:51.827+0800] GC(17104) Metaspace: 107320K(107904K)->107320K(107904K) NonClass: 97286K(97600K)->97286K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:51.827+0800] GC(17104) Pause Young (Concurrent Start) (G1 Evacuation Pause) 8049M->8049M(8192M) 3.345ms
[2024-10-09T07:06:51.827+0800] GC(17104) User=0.01s Sys=0.00s Real=0.01s
[2024-10-09T07:06:51.827+0800] Attempting full compaction
[2024-10-09T07:06:51.827+0800] GC(17105) Using 4 workers of 4 for full compaction
[2024-10-09T07:06:51.827+0800] GC(17105) Pause Full (G1 Compaction Pause)
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Mark Cycle
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Clear Claimed Marks
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Clear Claimed Marks 0.035ms
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Scan Root Regions
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Scan Root Regions 0.003ms
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Mark
[2024-10-09T07:06:51.827+0800] GC(17106) Concurrent Mark From Roots
[2024-10-09T07:06:51.827+0800] GC(17106) Using 1 workers of 1 for marking
[2024-10-09T07:06:51.832+0800] GC(17105) Phase 1: Mark live objects
[2024-10-09T07:06:52.343+0800] GC(17105) Phase 1: Mark live objects 510.572ms
[2024-10-09T07:06:52.343+0800] GC(17105) Phase 2: Prepare for compaction
[2024-10-09T07:06:52.348+0800] GC(17105) Phase 2: Prepare for compaction 5.355ms
[2024-10-09T07:06:52.348+0800] GC(17105) Phase 3: Adjust pointers
[2024-10-09T07:06:52.396+0800] GC(17105) Phase 3: Adjust pointers 47.327ms
[2024-10-09T07:06:52.396+0800] GC(17105) Phase 4: Compact heap
[2024-10-09T07:06:52.397+0800] GC(17105) Phase 4: Compact heap 1.801ms
[2024-10-09T07:06:52.747+0800] GC(17105) Eden regions: 0->0(102)
[2024-10-09T07:06:52.747+0800] GC(17105) Survivor regions: 0->0(0)
[2024-10-09T07:06:52.747+0800] GC(17105) Old regions: 2010->2009
[2024-10-09T07:06:52.747+0800] GC(17105) Archive regions: 2->2
[2024-10-09T07:06:52.747+0800] GC(17105) Humongous regions: 36->36
[2024-10-09T07:06:52.747+0800] GC(17105) Metaspace: 107320K(107904K)->107320K(107904K) NonClass: 97286K(97600K)->97286K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:52.747+0800] GC(17105) Pause Full (G1 Compaction Pause) 8049M->8045M(8192M) 919.954ms
[2024-10-09T07:06:52.747+0800] GC(17105) User=2.29s Sys=0.00s Real=0.92s
[2024-10-09T07:06:52.749+0800] GC(17107) Pause Young (Normal) (G1 Evacuation Pause)
[2024-10-09T07:06:52.749+0800] GC(17107) Using 4 workers of 4 for evacuation
[2024-10-09T07:06:52.752+0800] GC(17107) To-space exhausted
[2024-10-09T07:06:52.752+0800] GC(17107) Pre Evacuate Collection Set: 0.4ms
[2024-10-09T07:06:52.752+0800] GC(17107) Merge Heap Roots: 0.0ms
[2024-10-09T07:06:52.752+0800] GC(17107) Evacuate Collection Set: 1.1ms
[2024-10-09T07:06:52.752+0800] GC(17107) Post Evacuate Collection Set: 1.1ms
[2024-10-09T07:06:52.752+0800] GC(17107) Other: 0.1ms
[2024-10-09T07:06:52.752+0800] GC(17107) Eden regions: 1->0(102)
[2024-10-09T07:06:52.752+0800] GC(17107) Survivor regions: 0->0(0)
[2024-10-09T07:06:52.752+0800] GC(17107) Old regions: 2009->2010
[2024-10-09T07:06:52.752+0800] GC(17107) Archive regions: 2->2
[2024-10-09T07:06:52.752+0800] GC(17107) Humongous regions: 36->36
[2024-10-09T07:06:52.752+0800] GC(17107) Metaspace: 107321K(107904K)->107321K(107904K) NonClass: 97287K(97600K)->97287K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:52.752+0800] GC(17107) Pause Young (Normal) (G1 Evacuation Pause) 8049M->8049M(8192M) 2.764ms
[2024-10-09T07:06:52.752+0800] GC(17107) User=0.00s Sys=0.00s Real=0.00s
[2024-10-09T07:06:52.752+0800] Attempting full compaction
[2024-10-09T07:06:52.752+0800] GC(17108) Using 4 workers of 4 for full compaction
[2024-10-09T07:06:52.752+0800] GC(17108) Pause Full (G1 Compaction Pause)
[2024-10-09T07:06:52.752+0800] GC(17108) Phase 1: Mark live objects
[2024-10-09T07:06:53.515+0800] GC(17108) Phase 1: Mark live objects 762.949ms
[2024-10-09T07:06:53.515+0800] GC(17108) Phase 2: Prepare for compaction
[2024-10-09T07:06:53.521+0800] GC(17108) Phase 2: Prepare for compaction 6.362ms
[2024-10-09T07:06:53.521+0800] GC(17108) Phase 3: Adjust pointers
[2024-10-09T07:06:53.573+0800] GC(17108) Phase 3: Adjust pointers 52.106ms
[2024-10-09T07:06:53.574+0800] GC(17108) Phase 4: Compact heap
[2024-10-09T07:06:53.574+0800] GC(17108) Phase 4: Compact heap 0.401ms
[2024-10-09T07:06:53.606+0800] GC(17108) Eden regions: 0->0(102)
[2024-10-09T07:06:53.606+0800] GC(17108) Survivor regions: 0->0(0)
[2024-10-09T07:06:53.606+0800] GC(17108) Old regions: 2010->2009
[2024-10-09T07:06:53.606+0800] GC(17108) Archive regions: 2->2
[2024-10-09T07:06:53.606+0800] GC(17108) Humongous regions: 36->36
[2024-10-09T07:06:53.606+0800] GC(17108) Metaspace: 107321K(107904K)->107321K(107904K) NonClass: 97287K(97600K)->97287K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:53.606+0800] GC(17108) Pause Full (G1 Compaction Pause) 8049M->8046M(8192M) 854.164ms
[2024-10-09T07:06:53.606+0800] GC(17108) User=1.66s Sys=0.00s Real=0.86s
[2024-10-09T07:06:53.608+0800] GC(17106) Concurrent Mark From Roots 1780.141ms
[2024-10-09T07:06:53.608+0800] GC(17106) Concurrent Mark Abort
[2024-10-09T07:06:53.608+0800] GC(17106) Concurrent Mark Cycle 1780.223ms
[2024-10-09T07:06:53.608+0800] GC(17109) Pause Young (Normal) (G1 Evacuation Pause)
[2024-10-09T07:06:53.608+0800] GC(17109) Using 4 workers of 4 for evacuation
[2024-10-09T07:06:53.610+0800] GC(17109) To-space exhausted
[2024-10-09T07:06:53.610+0800] GC(17109) Pre Evacuate Collection Set: 0.3ms
[2024-10-09T07:06:53.610+0800] GC(17109) Merge Heap Roots: 0.0ms
[2024-10-09T07:06:53.610+0800] GC(17109) Evacuate Collection Set: 0.9ms
[2024-10-09T07:06:53.610+0800] GC(17109) Post Evacuate Collection Set: 1.0ms
[2024-10-09T07:06:53.610+0800] GC(17109) Other: 0.1ms
[2024-10-09T07:06:53.610+0800] GC(17109) Eden regions: 1->0(102)
[2024-10-09T07:06:53.610+0800] GC(17109) Survivor regions: 0->0(0)
[2024-10-09T07:06:53.610+0800] GC(17109) Old regions: 2009->2010
[2024-10-09T07:06:53.610+0800] GC(17109) Archive regions: 2->2
[2024-10-09T07:06:53.610+0800] GC(17109) Humongous regions: 36->36
[2024-10-09T07:06:53.610+0800] GC(17109) Metaspace: 107321K(107904K)->107321K(107904K) NonClass: 97287K(97600K)->97287K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:53.610+0800] GC(17109) Pause Young (Normal) (G1 Evacuation Pause) 8049M->8049M(8192M) 2.497ms
[2024-10-09T07:06:53.610+0800] GC(17109) User=0.00s Sys=0.00s Real=0.00s
[2024-10-09T07:06:53.610+0800] Attempting full compaction
[2024-10-09T07:06:53.610+0800] GC(17110) Using 4 workers of 4 for full compaction
[2024-10-09T07:06:53.610+0800] GC(17110) Pause Full (G1 Compaction Pause)
[2024-10-09T07:06:53.611+0800] GC(17110) Phase 1: Mark live objects
[2024-10-09T07:06:54.403+0800] GC(17110) Phase 1: Mark live objects 792.711ms
[2024-10-09T07:06:54.403+0800] GC(17110) Phase 2: Prepare for compaction
[2024-10-09T07:06:54.409+0800] GC(17110) Phase 2: Prepare for compaction 5.755ms
[2024-10-09T07:06:54.409+0800] GC(17110) Phase 3: Adjust pointers
[2024-10-09T07:06:54.455+0800] GC(17110) Phase 3: Adjust pointers 45.862ms
[2024-10-09T07:06:54.455+0800] GC(17110) Phase 4: Compact heap
[2024-10-09T07:06:54.456+0800] GC(17110) Phase 4: Compact heap 1.335ms
[2024-10-09T07:06:54.744+0800] GC(17110) Eden regions: 0->0(102)
[2024-10-09T07:06:54.744+0800] GC(17110) Survivor regions: 0->0(0)
[2024-10-09T07:06:54.744+0800] GC(17110) Old regions: 2010->2009
[2024-10-09T07:06:54.744+0800] GC(17110) Archive regions: 2->2
[2024-10-09T07:06:54.744+0800] GC(17110) Humongous regions: 36->36

[2024-10-09T07:06:54.744+0800] GC(17110) Metaspace: 107321K(107904K)->107321K(107904K) NonClass: 97287K(97600K)->97287K(97600K) Class: 10033K(10304K)->10033K(10304K)
[2024-10-09T07:06:54.744+0800] GC(17110) Pause Full (G1 Compaction Pause) 8049M->8045M(8192M) 1133.942ms
[2024-10-09T07:06:54.744+0800] GC(17110) User=3.22s Sys=0.00s Real=1.13s
[2024-10-09T07:06:54.746+0800] Heap
[2024-10-09T07:06:54.746+0800] garbage-first heap total 8388608K, used 8242649K [0x0000000600000000, 0x0000000800000000)
[2024-10-09T07:06:54.746+0800] region size 4096K, 1 young (4096K), 0 survivors (0K)
[2024-10-09T07:06:54.746+0800] Metaspace used 107321K, committed 107904K, reserved 1155072K
[2024-10-09T07:06:54.746+0800] class space used 10033K, committed 10304K, reserved 1048576K

调整一下fe 的jvm内存