StarRocks2.1.5 fe故障并且无法自动启动

【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】集群稳定性降低
【StarRocks版本】例如:2.1.5
【集群规模】例如:5fe(1 follower+2observer)+12be(fe与be不混部)
【机器信息】CPU虚拟核/内存/网卡,例如:64C/256G/万兆
【附件】
fe.warn.log 所有日志
2022-06-06 15:26:31,055 WARN (UNKNOWN 10.104.7.55_9010_1654499934397(-1)|1) [Catalog.notifyNewFETypeTransfer():2362] notify new FE type transfer: UNKNOWN
2022-06-06 15:26:31,069 WARN (RepNode 10.104.7.55_9010_1654499934397(-1)|65) [Catalog.notifyNewFETypeTransfer():2362] notify new FE type transfer: OBSERVER
2022-06-06 15:26:31,062 WARN (UNKNOWN 10.104.7.55_9010_1654499934397(-1)|1) [BDBEnvironment.setup():218] database exception
com.sleepycat.je.DatabaseNotFoundException: (JE 7.3.8) Database epochDB not found.
at com.sleepycat.je.Environment.setupDatabase(Environment.java:838) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.Environment.openDatabase(Environment.java:652) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:208) [starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.open(BDBJEJournal.java:271) [starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.open(EditLog.java:843) [starrocks-fe.jar:?]
at com.starrocks.catalog.Catalog.initialize(Catalog.java:871) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:107) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:62) [starrocks-fe.jar:?]
2022-06-06 15:26:31,095 WARN (REPLICA 10.104.7.55_9010_1654499934397(2147483646)|65) [BDBStateChangeListener.stateChange():61] this node is DETACHED
2022-06-06 15:26:36,077 WARN (UNKNOWN 10.104.7.55_9010_1654499934397(-1)|1) [BDBEnvironment.setup():211] insufficient exception, refresh and setup again
com.sleepycat.je.rep.InsufficientLogException: (JE 7.3.8) Environment must be closed, caused by: com.sleepycat.je.rep.InsufficientLogException: Environment invalid because of previous exception: (JE 7.3.8) 10.104.7.55_9010_1654499934397(2147483646):/opt/doris/StarRocks-2.1.5/fe/meta/bdb INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 10.104.7.55_9010_1654499934397(2147483646) Originally thrown by HA thread: REPLICA 10.104.7.55_9010_1654499934397(2147483646)refreshVLSN=1,003,536,774 logProviders=[Node:10.104.7.145_9010_1614221836305 10.104.7.145:9010 (is member) SECONDARY changeVersion:-1 LocalCBVLSN:1,003,551,499 at:Mon Jun 06 15:26:30 CST 2022 jeVersion:7.3.8
, Node:10.104.7.146_9010_1614221836139 10.104.7.146:9010 (is member) changeVersion:6 LocalCBVLSN:1,003,551,499 at:Mon Jun 06 15:26:30 CST 2022 jeVersion:7.3.8
, Node:10.104.7.131_9010_1614221836020 10.104.7.131:9010 (is member) changeVersion:5 LocalCBVLSN:1,003,551,497 at:Mon Jun 06 15:26:30 CST 2022 jeVersion:7.3.8
, Node:10.104.7.41_9010_1614221791093 10.104.7.41:9010 (is member) changeVersion:4 LocalCBVLSN:1,003,551,523 at:Mon Jun 06 15:26:30 CST 2022 jeVersion:7.3.8
, Node:10.104.7.55_9010_1654499934397 10.104.7.55:9010 (is member) SECONDARY changeVersion:-1 LocalCBVLSN:-1 at:Mon Jun 06 15:26:31 CST 2022 jeVersion:7.3.8
] repImpl=com.sleepycat.je.rep.impl.RepImpl@62e12438 props={GROUP_NAME=PALO_JOURNAL_GROUP, REFRESH_VLSN=1003536774, NODE_NAME=10.104.7.55_9010_1654499934397, HOSTNAME=10.104.7.55, P_NODETYPE4=SECONDARY, P_NODETYPE3=ELECTABLE, P_NODETYPE2=ELECTABLE, P_NODENAME4=10.104.7.55_9010_1654499934397, P_NODETYPE1=ELECTABLE, P_HOSTNAME4=10.104.7.55, P_NODENAME3=10.104.7.41_9010_1614221791093, P_NODETYPE0=SECONDARY, P_HOSTNAME3=10.104.7.41, P_NODENAME2=10.104.7.131_9010_1614221836020, P_HOSTNAME2=10.104.7.131, P_NODENAME1=10.104.7.146_9010_1614221836139, P_HOSTNAME1=10.104.7.146, P_NODENAME0=10.104.7.145_9010_1614221836305, PORT=9010, P_HOSTNAME0=10.104.7.145, P_NUMPROVIDERS=5, P_PORT4=9010, P_PORT3=9010, ENV_DIR=/opt/doris/StarRocks-2.1.5/fe/meta/bdb, P_PORT2=9010, P_PORT1=9010, P_PORT0=9010}
at com.sleepycat.je.rep.InsufficientLogException.wrapSelf(InsufficientLogException.java:315) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:151) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:267) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.Environment.(Environment.java:252) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:607) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:466) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.ReplicatedEnvironment.(ReplicatedEnvironment.java:540) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:178) [starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.open(BDBJEJournal.java:271) [starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.open(EditLog.java:843) [starrocks-fe.jar:?]
at com.starrocks.catalog.Catalog.initialize(Catalog.java:871) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:107) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:62) [starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.InsufficientLogException: Environment invalid because of previous exception: (JE 7.3.8) 10.104.7.55_9010_1654499934397(2147483646):/opt/doris/StarRocks-2.1.5/fe/meta/bdb INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA 10.104.7.55_9010_1654499934397(2147483646) Originally thrown by HA thread: REPLICA 10.104.7.55_9010_1654499934397(2147483646)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupLogRefresh(ReplicaFeederSyncup.java:664) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.getFeederRecord(ReplicaFeederSyncup.java:732) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.findMatchpoint(ReplicaFeederSyncup.java:406) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:151) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:711) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:474) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:409) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1873) ~[starrocks-bdb-je-7.3.8.jar:?]
2022-06-06 15:26:36,113 ERROR (UNKNOWN 10.104.7.55_9010_1654499934397(-1)|1) [BDBJEJournal.open():273] catch an exception when setup bdb environment. will exit.
com.sleepycat.je.EnvironmentFailureException: (JE 7.3.8) Tried and failed with every node UNEXPECTED_STATE: Unexpected internal state, may have side effects.
at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:428) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.sleepycat.je.rep.NetworkRestore.execute(NetworkRestore.java:321) ~[starrocks-bdb-je-7.3.8.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.refreshLog(BDBEnvironment.java:241) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:212) ~[starrocks-fe.jar:?]
at com.starrocks.journal.bdbje.BDBJEJournal.open(BDBJEJournal.java:271) [starrocks-fe.jar:?]
at com.starrocks.persist.EditLog.open(EditLog.java:843) [starrocks-fe.jar:?]
at com.starrocks.catalog.Catalog.initialize(Catalog.java:871) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.start(StarRocksFE.java:107) [starrocks-fe.jar:?]
at com.starrocks.StarRocksFE.main(StarRocksFE.java:62) [starrocks-fe.jar:?]

尝试将该fe节点踢出集群,删除meta目录,使用–help命令重新加入集群,方案失败,还是一样的报错

尝试回退到2.1.4版本和2.0.1版本重启,方案失败,还是一样的报错

在fe.conf中添加ignore_unknown_log_id=true,然后启动看看尼?

不太行,一样的报错

尝试将该fe节点踢出集群,删除meta目录,使用–help命令重新加入集群,方案失败,还是一样的报错


这里你是怎么操作的,先alter system drop fe然后addfe,再清空meta文件夹,–helper加入的集群?

你的机器是多网卡不,在fe.conf中有配置network参数指定ip吗

Licensed to the Apache Software Foundation (ASF) under one

or more contributor license agreements. See the NOTICE file

distributed with this work for additional information

regarding copyright ownership. The ASF licenses this file

to you under the Apache License, Version 2.0 (the

“License”); you may not use this file except in compliance

with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,

software distributed under the License is distributed on an

“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

KIND, either express or implied. See the License for the

specific language governing permissions and limitations

under the License.

#####################################################################

The uppercase properties are read and exported by bin/start_fe.sh.

To see all Frontend configurations,

see fe/src/org/apache/doris/common/Config.java

#####################################################################

the output dir of stderr and stdout

#LOG_DIR = ${DORIS_HOME}/log
LOG_DIR = /var/log/doris

DATE = date +%Y%m%d-%H%M%S
JAVA_OPTS= -Xmx64g -XX:+UseMembar -XX:+UseG1GC -XX:G1HeapRegionSize=32m -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=16 -XX:ParallelGCThreads=40 -XX:InitiatingHeapOccupancyPercent=45 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:/var/log/doris/fe.gc.log.$DATE

For jdk 9+, this JAVA_OPTS will be used as default JVM options

#JAVA_OPTS_FOR_JDK_9="-Xmx4096m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:$DORIS_HOME/log/fe.gc.log.$DATE:time"

the lowercase properties are read by main program.

INFO, WARN, ERROR, FATAL

sys_log_level = INFO

store metadata, create it if it is not exist.

Default value is ${DORIS_HOME}/meta

meta_dir = ${DORIS_HOME}/meta

http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
mysql_service_nio_enabled = true

Choose one if there are more than one ip except loopback address.

Note that there should at most one ip match this list.

If no ip match this rule, will choose one randomly.

use CIDR format, e.g. 10.10.10.0/24

Default value is empty.

priority_networks = 10.10.10.0/24;192.168.0.0/16

Advanced configurations

log_roll_size_mb = 1024
sys_log_dir = /var/log/doris

sys_log_roll_num = 10

sys_log_verbose_modules =

audit_log_dir = /var/log/doris
audit_log_modules = slow_query, query
audit_log_roll_num = 10
meta_delay_toleration_second = 30
qe_max_connection = 1024
max_conn_per_user = 100
qe_query_timeout_second = 300
qe_slow_log_ms = 5000

max_bytes_per_broker_scanner = 322122547200
max_broker_concurrency = 1024

这是fe的conf文件,#好像被解析成加粗了

ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 10.104.7.55 netmask 255.255.255.128 broadcast 10.104.7.127
inet6 fe80::ea61:1fff:fe34:a772 prefixlen 64 scopeid 0x20
ether e8:61:1f:34:a7:72 txqueuelen 1000 (Ethernet)
RX packets 14873656606 bytes 10275719977564 (9.3 TiB)
RX errors 0 dropped 61 overruns 0 frame 0
TX packets 36247937775 bytes 46882375076720 (42.6 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp65s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether e8:61:1f:34:a7:72 txqueuelen 1000 (Ethernet)
RX packets 7813057162 bytes 5244574860858 (4.7 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 17543015292 bytes 22796411048685 (20.7 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp65s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether e8:61:1f:34:a7:72 txqueuelen 1000 (Ethernet)
RX packets 7060599448 bytes 5031145116970 (4.5 TiB)
RX errors 0 dropped 61 overruns 0 frame 0
TX packets 18704922488 bytes 24085964028977 (21.9 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Local Loopback)
RX packets 273651461 bytes 38047175251 (35.4 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 273651461 bytes 38047175251 (35.4 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

这是ifconfig的返回结果

  1. drop fe
  2. 清空meta目录
  3. –helper启动
  4. add fe
    –helper启动之后进程不会挂掉,但是在master里将该节点alter system add 之后就会挂掉

看下follower里面那个meta/image目录下面,最新的文件是多少?以及bdb目录的大小。

其他节点都是11G ,这个故障节点因为删过meta目录,现在110k

com.sleepycat.je.DatabaseNotFoundException: (JE 7.3.8) Database epochDB not found.
我觉得这行日志很奇怪,我根本就从未创建过名字为epochDB 的database,并且今天重启过一次其他fe节点,该节点的重启日志里没有这一行

meta/image目录下的文件时间是什么?

image.499269132
这是正常节点的

image.499214710
这是故障节点的,因为故障节点现在没在尝试启动,有差距我认为正常

我的意思是文件的生成时间,不是文件名

image
这是故障的

image
这是正常的

先把这个故障的从集群里踢出去,drop 调。然后等其他节点的bdb目录变小了(1G以内)之后,再加下试试。现在的原因是有个节点故障了,bdb里的数据不会清理。