BE not alived

【详述】
在单台机器上,混合部署1FE与1BE。

  1. FE状态正常,通过MySQL客户端从9030端口登录后,执行SHOW FRONTENDS\G查询,结果如下:
*************************** 1. row ***************************
             Name: 10.60.232.20_9010_1669260589173
               IP: 10.60.232.20
      EditLogPort: 9010
         HttpPort: 8030
        QueryPort: 9030
          RpcPort: 9020
             Role: FOLLOWER
         IsMaster: true
        ClusterId: 2034440467
             Join: true
            Alive: true
ReplayedJournalId: 20981
    LastHeartbeat: 2022-11-24 14:37:10
         IsHelper: true
           ErrMsg:
        StartTime: 2022-11-23 23:48:15
          Version: UNKNOWN-
1 row in set (0.01 sec)
  1. 执行SHOW BACKENDS\G时,结果如下:
*************************** 1. row ***************************
            BackendId: 17335
              Cluster: default_cluster
                   IP: 10.60.232.20
        HeartbeatPort: 9050
               BePort: -1
             HttpPort: -1
             BrpcPort: -1
        LastStartTime: NULL
        LastHeartbeat: NULL
                Alive: false
 SystemDecommissioned: false
ClusterDecommissioned: false
            TabletNum: 0
     DataUsedCapacity: .000
        AvailCapacity: 1.000 B
        TotalCapacity: .000
              UsedPct: 0.00 %
       MaxDiskUsedPct: 0.00 %
               ErrMsg: java.net.ConnectException: Connection refused (Connection refused)
              Version: UNKNOWN-
               Status: {"lastSuccessReportTabletsTime":"N/A"}
    DataTotalCapacity: .000
          DataUsedPct: 0.00 %
             CpuCores: 0
1 row in set (0.00 sec)
  1. 重新在FE执行BE节点注册后,无法解决该问题
ALTER SYSTEM DROP BACKEND "10.60.232.20:9050";
ALTER SYSTEM ADD BACKEND "10.60.232.20:9050";
  1. BE节点启动时stdout如下
ERROR 1064 (HY000) at line 1: Same backend already exists
  1. be.conf如下
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

# INFO, WARNING, ERROR, FATAL
sys_log_level = INFO

# ports for admin, web, heartbeat service
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060

# Choose one if there are more than one ip except loopback address.
# Note that there should at most one ip match this list.
# If no ip match this rule, will choose one randomly.
# use CIDR format, e.g. 10.10.10.0/24
# Default value is empty.
priority_networks = 10.60.232.0/24

# data root path, separate by ';'
# you can specify the storage medium of each root path, HDD or SSD, seperate by ','
# eg:
# storage_root_path = /data1,medium:HDD;/data2,medium:SSD;/data3
# /data1, HDD;
# /data2, SSD;
# /data3, HDD(default);
#
# Default value is ${STARROCKS_HOME}/storage, you should create it by hand.
# storage_root_path = ${STARROCKS_HOME}/storage

# Advanced configurations
# sys_log_dir = ${STARROCKS_HOME}/log
# sys_log_roll_mode = SIZE-MB-300
# sys_log_roll_num = 10
# sys_log_verbose_modules = *
# log_buffer_level = -1

default_rowset_type = beta

# memory

# https://github.com/StarRocks/starrocks/pull/10252
tc_use_memory_min = 0
tc_free_memory_rate = 0

mem_limit = 24%
load_process_max_memory_limit_percent = 60

【业务影响】BE服务不可用
【StarRocks版本】2.2.5
【集群规模】1fe + 1be(混部)

这个问题出现有两个环境,一个是Ubuntu 22.04,一个是StarRocks 2.4.x

普遍现象是要么直接报错:undefined symbol: _dl_sym, version GLIBC_PRIVATE
要么就是be.out直接摆烂:start from之后无任何输出,也没有更多的日志文件

从堆栈信息上看,启动starrocks_be之后会在brpc的初始化阶段直接挂起,无法进入main函数

目前最快的方法是直接更换系统到 Ubuntu 18.04 或者 CentOS / Rocky Linux 的稳定版本上

@U_1669320035544_5676
感谢分享!我们的场景是在2.2.5版本通过容器镜像进行部署的,容器本身的基础镜像是centos7。在大部分场景下都部署正常可用,只是这一次偶发了这一情况。

看起来您建议拿出be/log路径下的所有日志文件去查看下?

可以提供下 be.INFO be.out的日志么?

另外确认下 ps -ef | grep starrocks_be be进程是否存活

@shemplle
be/log路径下只有be.out文件,没有如be.log、be.INFO等日志文件。

be.out的日志记录如下

start time: Thu Nov 24 09:59:30 UTC 2022
start time: Thu Nov 24 10:01:17 UTC 2022
start time: Thu Nov 24 10:03:15 UTC 2022
start time: Thu Nov 24 10:05:27 UTC 2022
start time: Thu Nov 24 10:08:01 UTC 2022
start time: Thu Nov 24 10:11:17 UTC 2022
start time: Thu Nov 24 10:15:50 UTC 2022
start time: Thu Nov 24 10:22:47 UTC 2022
start time: Thu Nov 24 10:29:39 UTC 2022
start time: Thu Nov 24 10:36:33 UTC 2022
start time: Thu Nov 24 10:43:21 UTC 2022
start time: Thu Nov 24 10:50:16 UTC 2022
start time: Thu Nov 24 10:57:02 UTC 2022
start time: Thu Nov 24 11:03:51 UTC 2022
start time: Thu Nov 24 11:10:46 UTC 2022
start time: Thu Nov 24 11:17:36 UTC 2022
start time: Thu Nov 24 11:24:24 UTC 2022
start time: Thu Nov 24 11:31:21 UTC 2022
...

ubuntu兼容问题,2.4.1已解决

@trueeyu
感谢分享!问题所述的场景是,容器化部署BE,BE镜像的系统是centos7,宿主机的系统也是centos7。是不是和您提到的ubuntu兼容问题不一定有关?

你的场景是什么?dmesg -T 有OOM日志吗?支持avx2指令集吗?

楼主的是Ubuntu兼容问题,你的要具体看下,加个微信聊下,一起看看什么问题?

@trueeyu 哈,我就是楼主,混部的场景就是在centos7系统的宿主机上部署了以centos7系统为基础镜像的FE、BE两个独立的服务。

我会根据您的建议再检查下宿主机是否支持avx2指令集。有两个小问题哈,

  1. 您提到的查询OOM日志是指BE,还是BE与FE都需要看看?
  2. 不支持avx2指令集的宿主机上部署,是不是应该FE和BE服务都启动不起来,而不是现象里描述的FE启动了,但BE异常。

不支持AVX2的话,FE可以启动成功,BE启动失败。
现在BE启动还报这个错 " Same backend already exists"?

[问题原因]

用户部署错误,在不支持AVX2指令集的机器部署导致的。之前迟迟没有排查出来,是因为对方在错误的、并非部署StarRocks的机器(该机器支持AVX2指令集)上执行指令集支持情况检查导致的。是个乌龙。

再次感谢 @U_1669320035544_5676 @shemplle @trueeyu 提供的建议和帮助~