【FE】升级2.5.21后fe启动失败

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】SR版本升级 2.4.5 -> 2.5.21, 升级后只有9010端口启动了,其他端口都没启动;目前集群已回滚到2.4.5
【背景】集群升级
【业务影响】SR无法访问
【是否存算分离】否
【StarRocks版本】2.5.21
【集群规模】3fe(1 follower+2observer)+10be(fe与be分离)
【机器信息】CPU虚拟核/内存/网卡,32C/128G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,社区群12-金谡 jinsu@moojing.com【附件】

升级操作过程:

  1. 滚动升级be, 所有be替换2.5.21版本后,正常启动
  2. 升级fe follower, 2个fe 替换2.5.21版本后,启动fe未报错,但只有9010端口正常启动
  3. 升级fe leader, fe 替换2.5.21版本后,启动fe未报错,但只有9010端口正常启动

fe日志中, 从23:49开始启动fe, 到23:59左右fe日志中只有少量重复信息,一直等到00:30 fe的cpu基本为0,日志也没有新的信息,仍然只有9010端口,停止升级,回滚集群到2.4.5

meta目录大小173MB

第二步骤升级第一个fe follower,端口没有启动成功需要看下日志检查下原因,不能继续操作,当时端口没启动起来的fe日志里输出的是什么,日志还有嘛

2024-05-31 22:41:02,876 INFO (Thread-43|87) [StarRocksFE.lambda$addShutdownHook$1():354] start to execute shutdown hook
2024-05-31 22:41:02,881 INFO (Thread-43|87) [StarRocksFE.lambda$addShutdownHook$1():379] shutdown hook end
2024-05-31 22:43:52,996 INFO (main|1) [StarRocksFE.start():110] StarRocks FE starting, version: 2.5.21-dc2bcdb
2024-05-31 22:43:53,001 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():286] configured prior_cidrs value: 172.16.16.29
2024-05-31 22:43:53,003 INFO (main|1) [FrontendOptions.initAddrUseIp():236] Use IP init local addr, IP: /172.16.16.29
2024-05-31 22:43:53,168 INFO (main|1) [Auth.grantRoleInternal():779] grant operator to ‘root’@’%’, isReplay = true
2024-05-31 22:43:53,323 INFO (main|1) [NodeMgr.getHelperNodes():602] get helper nodes: [172.16.16.29:9010]
2024-05-31 22:43:53,336 INFO (main|1) [NodeMgr.getClusterIdAndRoleOnStartup():402] finished to get cluster id: 511371839, role: FOLLOWER and node name: 172.16.16.29_9010_1691737065524
2024-05-31 22:43:53,337 INFO (main|1) [BDBEnvironment.ensureHelperInLocal():321] skip check local environment because helper node and local node are identical.
2024-05-31 22:43:53,357 INFO (main|1) [BDBEnvironment.setupEnvironment():251] start to setup bdb environment for 1 times
2024-05-31 22:43:54,357 WARN (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [StateChangeExecutor.notifyNewFETypeTransfer():45] notify new FE type transfer: UNKNOWN
2024-05-31 22:43:54,379 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [BDBEnvironment.setupEnvironment():262] replicated environment is all set, wait for state change…
2024-05-31 22:43:54,379 WARN (RepNode 172.16.16.29_9010_1691737065524(-1)|54) [StateChangeExecutor.notifyNewFETypeTransfer():45] notify new FE type transfer: FOLLOWER
2024-05-31 22:43:54,380 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [BDBEnvironment.setupEnvironment():270] state change done, current role FOLLOWER
2024-05-31 22:43:54,385 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [BDBEnvironment.setupEnvironment():274] end setup bdb environment after 1 times
2024-05-31 22:43:54,388 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadImage():1350] start load image from /data/starrocks/meta/image/image.18261977. is ckpt: false
2024-05-31 22:43:54,388 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadHeader():1498] finished replay header from image
2024-05-31 22:43:54,389 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [NodeMgr.loadLeaderInfo():1171] finished replay masterInfo from image
2024-05-31 22:43:55,452 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [LocalMetastore.loadDb():344] finished replay databases from image
2024-05-31 22:43:56,287 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [Load.loadLoadJob():986] finished replay loadJob from image
2024-05-31 22:43:56,287 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadAlterJob():1513] finished replay alterJob from image
2024-05-31 22:43:56,288 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [CatalogRecycleBin.loadRecycleBin():1185] finished replay recycleBin from image
2024-05-31 22:43:56,296 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [VariableMgr.loadGlobalVariable():554] finished replay globalVariable from image
2024-05-31 22:43:56,298 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [LocalMetastore.loadCluster():4533] finished replay cluster from image
2024-05-31 22:43:56,300 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [NodeMgr.loadBrokers():1139] finished replay brokerMgr from image
2024-05-31 22:43:56,303 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadResources():1600] finished replay resources from image
2024-05-31 22:43:56,303 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadResources():1602] start to replay resource mapping catalog
2024-05-31 22:43:56,303 INFO (UNKNOWN 172.16.16.29_9010_1691737065524(-1)|1) [GlobalStateMgr.loadResources():1604] finished replaying resource mapping catalogs from resources

当时没开debug模式,log里面信息比较少

“finished replay cluster from image” 这个日志是fe正常的启动日志,可能因为元数据比较大回放过程较久,您等待一段时间不输出这个日志fe就真正起来了,起来后fe的所有端口都在,您再继续操作下一台就可以了。

meta目录只有173MB, 集群整体数据量估计是20TB, 这个操作预估要多久呢?

6-15晚上重新进行升级操作,21:21开始重启一个fe follow节点, 直至第二天07:00 fe仍未正常启动,只有9010端口

6-15晚上重新进行升级操作,21:21开始重启一个fe follow节点, 直至第二天07:00 fe仍未正常启动,只有9010端口

请发下fe.log 中输出的日志信息

fe.log (4.6 KB)

2024-06-15 21:21:41,351 INFO (main|1) [StarRocksFE.start():110] StarRocks FE starting, version: 2.5.21-dc2bcdb
2024-06-15 21:21:41,357 INFO (main|1) [FrontendOptions.analyzePriorityCidrs():286] configured prior_cidrs value: 172.16.16.44
2024-06-15 21:21:41,358 INFO (main|1) [FrontendOptions.initAddrUseIp():236] Use IP init local addr, IP: /172.16.16.44
2024-06-15 21:21:41,530 INFO (main|1) [Auth.grantRoleInternal():779] grant operator to ‘root’@’%’, isReplay = true
2024-06-15 21:21:41,690 INFO (main|1) [NodeMgr.getHelperNodes():602] get helper nodes: [172.16.16.44:9010]
2024-06-15 21:21:41,702 INFO (main|1) [NodeMgr.getClusterIdAndRoleOnStartup():402] finished to get cluster id: 511371839, role: FOLLOWER and node name: 172.16.16.44_9010_1691736040326
2024-06-15 21:21:41,703 INFO (main|1) [BDBEnvironment.ensureHelperInLocal():321] skip check local environment because helper node and local node are identical.
2024-06-15 21:21:41,724 INFO (main|1) [BDBEnvironment.setupEnvironment():251] start to setup bdb environment for 1 times
2024-06-15 21:21:42,711 WARN (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [StateChangeExecutor.notifyNewFETypeTransfer():45] notify new FE type transfer: UNKNOWN
2024-06-15 21:21:42,730 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BDBEnvironment.setupEnvironment():262] replicated environment is all set, wait for state change…
2024-06-15 21:21:42,730 WARN (RepNode 172.16.16.44_9010_1691736040326(-1)|54) [StateChangeExecutor.notifyNewFETypeTransfer():45] notify new FE type transfer: FOLLOWER
2024-06-15 21:21:42,731 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BDBEnvironment.setupEnvironment():270] state change done, current role FOLLOWER
2024-06-15 21:21:42,737 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BDBEnvironment.setupEnvironment():274] end setup bdb environment after 1 times
2024-06-15 21:21:42,741 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadImage():1350] start load image from /data/starrocks/meta/image/image.19559585. is ckpt: false
2024-06-15 21:21:42,741 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadHeader():1498] finished replay header from image
2024-06-15 21:21:42,742 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [NodeMgr.loadLeaderInfo():1171] finished replay masterInfo from image
2024-06-15 21:21:43,943 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [LocalMetastore.loadDb():344] finished replay databases from image
2024-06-15 21:21:45,001 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [Load.loadLoadJob():986] finished replay loadJob from image
2024-06-15 21:21:45,002 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadAlterJob():1513] finished replay alterJob from image
2024-06-15 21:21:45,004 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [CatalogRecycleBin.loadRecycleBin():1185] finished replay recycleBin from image
2024-06-15 21:21:45,016 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [VariableMgr.loadGlobalVariable():554] finished replay globalVariable from image
2024-06-15 21:21:45,020 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [LocalMetastore.loadCluster():4533] finished replay cluster from image
2024-06-15 21:21:45,022 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [NodeMgr.loadBrokers():1139] finished replay brokerMgr from image
2024-06-15 21:21:45,026 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadResources():1600] finished replay resources from image
2024-06-15 21:21:45,026 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadResources():1602] start to replay resource mapping catalog
2024-06-15 21:21:45,027 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [GlobalStateMgr.loadResources():1604] finished replaying resource mapping catalogs from resources
2024-06-15 21:21:45,027 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [ExportMgr.loadExportJob():385] finished replay exportJob from image
2024-06-15 21:21:45,054 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BlobStorage.getBroker():909] get broker: data_trans[TNetworkAddress(hostname:172.16.16.219, port:8000)]
2024-06-15 21:23:45,072 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BlobStorage.getBroker():909] get broker: data_trans[TNetworkAddress(hostname:172.16.19.6, port:8000)]
2024-06-15 21:25:45,165 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BlobStorage.getBroker():909] get broker: data_trans[TNetworkAddress(hostname:172.16.16.55, port:8000)]
2024-06-15 21:27:45,265 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BlobStorage.getBroker():909] get broker: data_trans[TNetworkAddress(hostname:172.16.18.40, port:8000)]

后面的日志都是重复的 2024-06-15 21:27:45,265 INFO (UNKNOWN 172.16.16.44_9010_1691736040326(-1)|1) [BlobStorage.getBroker():909] get broker: data_trans[TNetworkAddress(hostname:172.16.18.40, port:8000)]