【集群情况】
- 版本: 2.0.8
- FE 节点: 3个
- BE 节点: 3个
【遇到的问题】
1、凌晨1点多收到告警,FE 所有节点都连不上,自己尝试用客户端连接一直是卡住的状态。
2、查看 FE 日志有在正常刷,大部分都是 INFO 级别的日志,只有一个 ERROR 报错
2022-08-19 01:44:14,585 ERROR (starrocks-mysql-nio-pool-1509|59745) [AcceptListener.lambda$handleEvent$1():88] connect processor exception because
2022-08-19 01:59:53,565 INFO (starrocks-mysql-nio I/O-4|84) [AcceptListener.handleEvent():54] Connection established. remote=/192.168.1.xx:36356
2022-08-19 01:59:53,565 INFO (LoadLabelCleaner|70) [DatabaseTransactionMgr.removeExpiredTxns():1184] transaction [28421350] is expired, remove it from transaction manager
2022-08-19 01:59:53,565 INFO (starrocks-mysql-nio-pool-1515|59751) [MysqlChannel.fetchOnePacket():152] Receive packet header failed, remote 192.168.1.xx:55941 may close the channel.
2022-08-19 01:59:53,565 INFO (LoadLabelCleaner|70) [DatabaseTransactionMgr.removeExpiredTxns():1184] transaction [28421359] is expired, remove it from transaction manager
3、查看监控发现其中一个节点的 JVM 内存突然飙升,然后就获取不到数据了
4、重启这个 FE 节点后,其他两个 FE 节点都可以连接了
【疑问】
1、怀疑是 CMS FULL GC 引起 STW 导致集群卡死,但是我查看 kafka 的消费是正常的
2、其中一个 FE 节点 JVM 内存回收会导致其他两个 FE 节点也无法连接吗
【GC日志】
FE 卡死时间段的 gc 日志
2022-08-19T01:03:42.763+0800: 141643.052: [Full GC (Allocation Failure) 2022-08-19T01:03:42.763+0800: 141643.053: [CMS: 7281088K->7281087K(7281088K), 17.8900219 secs] 8277887K->7477598K(8277888K), [Metaspace: 68836K->68836K
(1114112K)], 17.8901450 secs] [Times: user=17.89 sys=0.00, real=17.89 secs]
2022-08-19T01:04:00.654+0800: 141660.943: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7494834K(8277888K), 0.0939809 secs] [Times: user=0.10 sys=0.00, real=0.09 secs]
2022-08-19T01:04:00.748+0800: 141661.037: [CMS-concurrent-mark-start]
2022-08-19T01:04:01.562+0800: 141661.851: [Full GC (Allocation Failure) 2022-08-19T01:04:01.562+0800: 141661.851: [CMS2022-08-19T01:04:03.886+0800: 141664.176: [CMS-concurrent-mark: 3.135/3.138 secs] [Times: user=13.36 sys=
0.00, real=3.14 secs]
2022-08-19T01:04:22.483+0800: 141682.773: [Full GC (Allocation Failure) 2022-08-19T01:04:22.483+0800: 141682.773: [CMS: 7281087K->7281088K(7281088K), 17.9024548 secs] 8277887K->7544253K(8277888K), [Metaspace: 68837K->68837K
(1114112K)], 17.9025585 secs] [Times: user=17.91 sys=0.00, real=17.90 secs]
2022-08-19T01:04:40.387+0800: 141700.676: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281088K(7281088K)] 7560388K(8277888K), 0.1251884 secs] [Times: user=0.13 sys=0.00, real=0.12 secs]
2022-08-19T01:04:40.512+0800: 141700.801: [CMS-concurrent-mark-start]
2022-08-19T01:04:41.302+0800: 141701.592: [Full GC (Allocation Failure) 2022-08-19T01:04:41.302+0800: 141701.592: [CMS2022-08-19T01:04:43.650+0800: 141703.940: [CMS-concurrent-mark: 3.135/3.138 secs] [Times: user=13.34 sys=
0.00, real=3.14 secs]
2022-08-19T01:05:02.669+0800: 141722.959: [Full GC (Allocation Failure) 2022-08-19T01:05:02.669+0800: 141722.959: [CMS: 7281087K->7281087K(7281088K), 18.4912923 secs] 8277887K->7591278K(8277888K), [Metaspace: 68837K->68837K
(1114112K)], 18.4913996 secs] [Times: user=18.50 sys=0.00, real=18.49 secs]
2022-08-19T01:05:21.161+0800: 141741.451: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7606209K(8277888K), 0.1657490 secs] [Times: user=0.17 sys=0.00, real=0.17 secs]
2022-08-19T01:05:21.327+0800: 141741.617: [CMS-concurrent-mark-start]
2022-08-19T01:05:22.113+0800: 141742.403: [Full GC (Allocation Failure) 2022-08-19T01:05:22.113+0800: 141742.403: [CMS2022-08-19T01:05:24.610+0800: 141744.900: [CMS-concurrent-mark: 3.281/3.283 secs] [Times: user=13.92 sys=
0.00, real=3.28 secs]
2022-08-19T01:05:41.594+0800: 141761.884: [Full GC (Allocation Failure) 2022-08-19T01:05:41.594+0800: 141761.884: [CMS: 7281087K->7281087K(7281088K), 18.0139144 secs] 8277887K->7666618K(8277888K), [Metaspace: 68833K->68833K
(1114112K)], 18.0140193 secs] [Times: user=18.02 sys=0.00, real=18.01 secs]
2022-08-19T01:05:59.609+0800: 141779.899: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7680466K(8277888K), 0.2987519 secs] [Times: user=0.30 sys=0.00, real=0.30 secs]
2022-08-19T01:05:59.908+0800: 141780.197: [CMS-concurrent-mark-start]
2022-08-19T01:06:00.625+0800: 141780.915: [Full GC (Allocation Failure) 2022-08-19T01:06:00.625+0800: 141780.915: [CMS2022-08-19T01:06:03.072+0800: 141783.361: [CMS-concurrent-mark: 3.160/3.164 secs] [Times: user=13.38 sys=
0.00, real=3.16 secs]
2022-08-19T01:06:21.920+0800: 141802.209: [Full GC (Allocation Failure) 2022-08-19T01:06:21.920+0800: 141802.209: [CMS: 7281087K->7281087K(7281088K), 18.5705534 secs] 8277887K->7704627K(8277888K), [Metaspace: 68833K->68833K
(1114112K)], 18.5706812 secs] [Times: user=18.57 sys=0.00, real=18.57 secs]
2022-08-19T01:06:40.491+0800: 141820.780: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7717231K(8277888K), 0.2050634 secs] [Times: user=0.21 sys=0.00, real=0.21 secs]
2022-08-19T01:06:40.696+0800: 141820.986: [CMS-concurrent-mark-start]
2022-08-19T01:06:41.417+0800: 141821.706: [Full GC (Allocation Failure) 2022-08-19T01:06:41.417+0800: 141821.706: [CMS2022-08-19T01:06:43.875+0800: 141824.165: [CMS-concurrent-mark: 3.177/3.179 secs] [Times: user=13.44 sys=
0.01, real=3.18 secs]