StarRocks K8s部署模式下重新部署FE后起不来

【详述】
StarRocks K8s部署模式下:
三个FE,三个BE。
均挂载了共享存储,
fe的meta文件夹均按照pod名字做了区分。
共享存储mata截图如下:


【StarRocks版本】3.0.2
具体现象:
第一次部署后,集群能正常启动。pod: my-starrockscluster-fe-0为leader。
然后执行:删除整个集群。(我这里是执行:kubectl -n starrocks delete -f fe_v5.yaml )
然后重建整个集群,(我这里是执行:kubectl -n starrocks create -f fe_v5.yaml )
由于meta文件都是放在共享存储上的,理论上,可以正确恢复整个集群。
但是集群却恢复不了;
查看my-starrockscluster-fe-0的日志如下:

由日志发现:
my-starrockscluster-fe-0在恢复的过程中,其变成了FOLLWER,然后就一直等待globalStrateMgr to be ready. 一直打印这个,
不知道是不是跟在k8s环境下fe之间的选住有关系。
求助啊,实在不理解,从原来的数据里恢复fe就起不来了,到底是什么原因造成的?? 求大佬解答

补充:
我将fe的副本设置为1,我查看my-starrockscluster-fe-0的ROLE,发现竟然是FOLLWER,此时集群正常,my-starrockscluster-fe-0其实是leader,ROLE文件过了好久也不会变成leader

补充:
my-starrockscluster-fe-0的meta文件打包如下:
meta.tar.gz (208.6 KB)

补充:
fe的部署yaml如下,image这里换成官方的即可。
fe_v5.yaml (7.2 KB)
使用的hostpath,且meta信息按照pod名字创建了不同的文件夹。

可能是什么原因导致globalStateMgr起不来呢?

2023-07-04 14:23:01,258 INFO (stateChangeExecutor|67) [StateChangeExecutor.runOneCycle():85] begin to transfer FE type from INIT to UNKNOWN
2023-07-04 14:23:01,258 INFO (stateChangeExecutor|67) [StateChangeExecutor.runOneCycle():179] finished to transfer FE type from INIT to UNKNOWN
2023-07-04 14:23:03,258 INFO (UNKNOWN my-starrockscluster-fe-0.my-starrockscluster-fe-search.starrocks.svc.cluster.local_9010_1688119487623(-1)|1) [GlobalStateMgr.waitForReady():1138] wait globalStateMgr to be ready. FE type: INIT. is ready: false

麻烦发下完整的日志, 可以先检查下启动的用户是不是同一个 重新拉一遍能拉起来吗?

我遇到的问题可能更奇怪些,三个FE节点,都生成自己是 Follower

重现过程

# 部署 StarRocks 集群,三节点。
kubectl apply -f deploy.yaml;

# 删除整个集群环境
kubectl delete -f deploy.yaml;

# 重新部署集群
kubectl apply -f deploy.yaml

关于 deploy.yaml 的特殊之处在于,meta 信息是持久化存储的。下面是一个完整的CRD定义。

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
  labels:
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Helm
    cluster: kube-starrocks
  name: kube-starrocks
  namespace: starrocks
spec:
  starRocksBeSpec:
    annotations:
      app.starrocks.io/be-config-hash: 951886d3
    configMapInfo:
      configMapName: kube-starrocks-be-cm
      resolveKey: be.conf
    configMaps:
    - mountPath: /etc/my-configmap
      name: my-configmap
    fsGroup: 0
    image: starrocks/be-ubuntu:2.5.4
    limits:
      cpu: 1
      memory: 2Gi
    replicas: 1
    requests:
      cpu: 1
      memory: 2Gi
    service:
      type: ClusterIP
    storageVolumes:
    - mountPath: /var/core-dir
      name: core
      storageSize: 10Gi
  starRocksFeSpec:
    annotations:
      app.starrocks.io/fe-config-hash: 7c889e00
    configMapInfo:
      configMapName: kube-starrocks-fe-cm
      resolveKey: fe.conf
    configMaps:
    - mountPath: /etc/my-configmap
      name: my-configmap
    fsGroup: 0
    image: starrocks/fe-ubuntu:2.5.4
    limits:
      cpu: 8
      memory: 8Gi
    replicas: 3
    requests:
      cpu: 100m
      memory: 300Mi
    service:
      type: ClusterIP
    storageVolumes:
    - mountPath: /opt/starrocks/fe/meta
      name: meta
      storageSize: 2Gi

问题排查

通过查看 fe_entrypoint.sh 脚本可知:probe leader 的方式有两种,如果是0号pod,则执行 probe_leader_for_pod0 ,否则执行 probe_leader_for_podX

启动顺序:/opt/starrocks/fe_entrypoint.sh --> /opt/starrocks/fe/bin/start_fe.sh --> /lib/jvm/default-java/bin/java

update_conf_from_configmap
collect_env_info
probe_leader $svc_name		# 先执行probe_leader,再调用 /opt/starrocks/fe/bin/start_fe.sh 启动FE
start_fe $svc_name

probe_leader()
{
    local svc=$1
    # find leader under current service and set to FE_LEADER
    if [[ "$POD_INDEX" -eq 0 ]] ; then
        probe_leader_for_pod0 $svc
    else
        probe_leader_for_podX $svc
    fi
}

probe_leader_for_pod0()
{
    # possible to have no result at all, because myself is the first FE instance in the cluster
    local svc=$1
    local start=`date +%s`
    local has_member=false
    local memlist=
    while true
    do
        memlist=`show_frontends $svc`
        local leader=`echo "$memlist" | grep '\<LEADER\>' | awk '{print $2}'`
        if [[ "x$leader" != "x" ]] ; then
            # has leader, done
            log_stderr "Find leader: $leader!"
            FE_LEADER=$leader
            return 0
        fi

        if [[ "x$memlist" != "x" ]] ; then
            # has memberlist ever before
            has_member=true
        fi

        # no leader yet, check if needs timeout and quit
        log_stderr "No leader yet, has_member: $has_member ..."
        local timeout=$PROBE_LEADER_POD0_TIMEOUT
        if $has_member ; then
            # set timeout to the same as PODX since there are other members
            timeout=$PROBE_LEADER_PODX_TIMEOUT
        fi

        local now=`date +%s`
        let "expire=start+timeout"
        if [[ $expire -le $now ]] ; then
            if $has_member ; then
                log_stderr "Timed out, abort!"
                exit 1
            else
                log_stderr "Timed out, no members detected ever, assume myself is the first node .."
                # empty FE_LEADER
                FE_LEADER=""
                return 0
            fi
        fi
        sleep $PROBE_INTERVAL
    done
}

probe_leader_for_podX()
{
    # wait until find a leader or timeout
    local svc=$1
    local start=`date +%s`
    while true
    do
        local leader=`show_frontends $svc | grep '\<LEADER\>' | awk '{print $2}'`
        if [[ "x$leader" != "x" ]] ; then
            # has leader, done
            log_stderr "Find leader: $leader!"
            FE_LEADER=$leader
            return 0
        fi
        # no leader yet, check if needs timeout and quit
        log_stderr "No leader yet ..."

        local now=`date +%s`
        let "expire=start+PROBE_LEADER_PODX_TIMEOUT"
        if [[ $expire -le $now ]] ; then
            log_stderr "Timed out, abort!"
            exit 1
        fi

        sleep $PROBE_INTERVAL
    done
}

show_frontends()
{
    local svc=$1
    # ensure `mysql` command can be ended with 15 seconds
    # "show frontends" query will hang when there is no leader yet in the cluster
    timeout 15 mysql --connect-timeout 2 -h $svc -P $QUERY_PORT -u root --skip-column-names --batch -e 'show frontends;'
}

总结如下:

  1. 如果是 0 号 Pod(probe_leader_for_pod0),探测不到 Leader,仍然会调用 start_fe.sh 来启动 fe。
  2. 如果是非0号Pod(probe_leader_for_podX),如果探测不到 Leader,则会直接退出!!(exit 1)

那么问题来了,切主的时候至少需要两个FE节点,造成FE处于不可用状态。

临时解决办法:

直接修改 Statefulset:

# 1. 修改启动参数。
        command:
        - /usr/bin/sleep
        args:
        - infinity

# 2. exec 进 pod 后,执行下面的命令。
# pod0: /opt/starrocks/fe_entrypoint.sh kube-starrocks-fe-service.starrocks  &
#       注意:kube-starrocks 是 StarRocksCluster 的CR名称,starrocks 是对应的 namespace
# 其它:/opt/starrocks/fe/bin/start_fe.sh &

我在StatefulSet定义的启动命令:

跟你这个一样的啊:

pod0: /opt/starrocks/fe_entrypoint.sh kube-starrocks-fe-service.starrocks

自己创建的statefulset需要加上podManagementPolicy: Parallel 允许所有POD同时启动, 这样才能从完全宕机的状态恢复.