FE 节点突然无法连接

【集群情况】

  • 版本: 2.0.8
  • FE 节点: 3个
  • BE 节点: 3个

【遇到的问题】

1、凌晨1点多收到告警,FE 所有节点都连不上,自己尝试用客户端连接一直是卡住的状态。

2、查看 FE 日志有在正常刷,大部分都是 INFO 级别的日志,只有一个 ERROR 报错

2022-08-19 01:44:14,585 ERROR (starrocks-mysql-nio-pool-1509|59745) [AcceptListener.lambda$handleEvent$1():88] connect processor exception because 
2022-08-19 01:59:53,565 INFO (starrocks-mysql-nio I/O-4|84) [AcceptListener.handleEvent():54] Connection established. remote=/192.168.1.xx:36356
2022-08-19 01:59:53,565 INFO (LoadLabelCleaner|70) [DatabaseTransactionMgr.removeExpiredTxns():1184] transaction [28421350] is expired, remove it from transaction manager
2022-08-19 01:59:53,565 INFO (starrocks-mysql-nio-pool-1515|59751) [MysqlChannel.fetchOnePacket():152] Receive packet header failed, remote 192.168.1.xx:55941 may close the channel.
2022-08-19 01:59:53,565 INFO (LoadLabelCleaner|70) [DatabaseTransactionMgr.removeExpiredTxns():1184] transaction [28421359] is expired, remove it from transaction manager

3、查看监控发现其中一个节点的 JVM 内存突然飙升,然后就获取不到数据了
image

4、重启这个 FE 节点后,其他两个 FE 节点都可以连接了

【疑问】

1、怀疑是 CMS FULL GC 引起 STW 导致集群卡死,但是我查看 kafka 的消费是正常的

2、其中一个 FE 节点 JVM 内存回收会导致其他两个 FE 节点也无法连接吗

【GC日志】

FE 卡死时间段的 gc 日志

2022-08-19T01:03:42.763+0800: 141643.052: [Full GC (Allocation Failure) 2022-08-19T01:03:42.763+0800: 141643.053: [CMS: 7281088K->7281087K(7281088K), 17.8900219 secs] 8277887K->7477598K(8277888K), [Metaspace: 68836K->68836K
(1114112K)], 17.8901450 secs] [Times: user=17.89 sys=0.00, real=17.89 secs] 
2022-08-19T01:04:00.654+0800: 141660.943: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7494834K(8277888K), 0.0939809 secs] [Times: user=0.10 sys=0.00, real=0.09 secs] 
2022-08-19T01:04:00.748+0800: 141661.037: [CMS-concurrent-mark-start]
2022-08-19T01:04:01.562+0800: 141661.851: [Full GC (Allocation Failure) 2022-08-19T01:04:01.562+0800: 141661.851: [CMS2022-08-19T01:04:03.886+0800: 141664.176: [CMS-concurrent-mark: 3.135/3.138 secs] [Times: user=13.36 sys=
0.00, real=3.14 secs] 
2022-08-19T01:04:22.483+0800: 141682.773: [Full GC (Allocation Failure) 2022-08-19T01:04:22.483+0800: 141682.773: [CMS: 7281087K->7281088K(7281088K), 17.9024548 secs] 8277887K->7544253K(8277888K), [Metaspace: 68837K->68837K
(1114112K)], 17.9025585 secs] [Times: user=17.91 sys=0.00, real=17.90 secs] 
2022-08-19T01:04:40.387+0800: 141700.676: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281088K(7281088K)] 7560388K(8277888K), 0.1251884 secs] [Times: user=0.13 sys=0.00, real=0.12 secs] 
2022-08-19T01:04:40.512+0800: 141700.801: [CMS-concurrent-mark-start]
2022-08-19T01:04:41.302+0800: 141701.592: [Full GC (Allocation Failure) 2022-08-19T01:04:41.302+0800: 141701.592: [CMS2022-08-19T01:04:43.650+0800: 141703.940: [CMS-concurrent-mark: 3.135/3.138 secs] [Times: user=13.34 sys=
0.00, real=3.14 secs] 
2022-08-19T01:05:02.669+0800: 141722.959: [Full GC (Allocation Failure) 2022-08-19T01:05:02.669+0800: 141722.959: [CMS: 7281087K->7281087K(7281088K), 18.4912923 secs] 8277887K->7591278K(8277888K), [Metaspace: 68837K->68837K
(1114112K)], 18.4913996 secs] [Times: user=18.50 sys=0.00, real=18.49 secs] 
2022-08-19T01:05:21.161+0800: 141741.451: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7606209K(8277888K), 0.1657490 secs] [Times: user=0.17 sys=0.00, real=0.17 secs] 
2022-08-19T01:05:21.327+0800: 141741.617: [CMS-concurrent-mark-start]
2022-08-19T01:05:22.113+0800: 141742.403: [Full GC (Allocation Failure) 2022-08-19T01:05:22.113+0800: 141742.403: [CMS2022-08-19T01:05:24.610+0800: 141744.900: [CMS-concurrent-mark: 3.281/3.283 secs] [Times: user=13.92 sys=
0.00, real=3.28 secs] 
2022-08-19T01:05:41.594+0800: 141761.884: [Full GC (Allocation Failure) 2022-08-19T01:05:41.594+0800: 141761.884: [CMS: 7281087K->7281087K(7281088K), 18.0139144 secs] 8277887K->7666618K(8277888K), [Metaspace: 68833K->68833K
(1114112K)], 18.0140193 secs] [Times: user=18.02 sys=0.00, real=18.01 secs] 
2022-08-19T01:05:59.609+0800: 141779.899: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7680466K(8277888K), 0.2987519 secs] [Times: user=0.30 sys=0.00, real=0.30 secs] 
2022-08-19T01:05:59.908+0800: 141780.197: [CMS-concurrent-mark-start]
2022-08-19T01:06:00.625+0800: 141780.915: [Full GC (Allocation Failure) 2022-08-19T01:06:00.625+0800: 141780.915: [CMS2022-08-19T01:06:03.072+0800: 141783.361: [CMS-concurrent-mark: 3.160/3.164 secs] [Times: user=13.38 sys=
0.00, real=3.16 secs] 
2022-08-19T01:06:21.920+0800: 141802.209: [Full GC (Allocation Failure) 2022-08-19T01:06:21.920+0800: 141802.209: [CMS: 7281087K->7281087K(7281088K), 18.5705534 secs] 8277887K->7704627K(8277888K), [Metaspace: 68833K->68833K
(1114112K)], 18.5706812 secs] [Times: user=18.57 sys=0.00, real=18.57 secs] 
2022-08-19T01:06:40.491+0800: 141820.780: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7281087K(7281088K)] 7717231K(8277888K), 0.2050634 secs] [Times: user=0.21 sys=0.00, real=0.21 secs] 
2022-08-19T01:06:40.696+0800: 141820.986: [CMS-concurrent-mark-start]
2022-08-19T01:06:41.417+0800: 141821.706: [Full GC (Allocation Failure) 2022-08-19T01:06:41.417+0800: 141821.706: [CMS2022-08-19T01:06:43.875+0800: 141824.165: [CMS-concurrent-mark: 3.177/3.179 secs] [Times: user=13.44 sys=
0.01, real=3.18 secs] 

您好,请问您FE节点的配置是多少?
fe.conf里面JAVA_OPTS配置是多少G?

配置的 8G,机器一共 64G,因为和 BE 放在同一个机器,不敢设置太大

能查到故障时间点,跑了什么SQL或者导入任务吗?

从日志初步判断有FULL GC,有两个方案:

  1. 可以让sqlclient去同时访问多个fe去做负载均衡;
  2. 修改fe.conf中jvm8g为12g(更大内存,减少 full gc 影响)

有定时任务,大概是写入几十万内的数据,SQL内存限制10G了

前面有ELB做负载均衡了,但是感觉只去到其中一个FE节点

而且当时所有FE节点都已经无法连接了,卡住了,负载均衡也没用吧,想不到一个 FE 节点卡住会影响其他节点的

可能因为只去了一个FE节点,导致这个FE节点压力过大,产生FULL GC。
您看一下这个故障节点是否是LEADER节点。
负载均衡是把请求均衡分配到3个FE节点,这样效果最好。