存算分离集群,无法新增分区, 无法创建表

另外这个bt看着只一个线程starrocks::StarOSWorker::get_shard_filesystem, 取到core时, 集群表象是卡住超时了吗?

取 gcore 时, pod 所在的 node 和 ip 没更换, 但是显示 cn 进程重启了, 图片是 K9S , 其中红圈处会 +1

gcore 文件20 多个 G, 我周一再重新导出一份, 用网盘传给你们吧

gcore时, 进程处于不响应状态. pod healthy check会失败, kubelet会杀pod重新拉起.

重新导出了一份 core文件20多 G,
百度网盘链接 通过网盘分享的文件:starrocks_be_core.27
链接: https://pan.baidu.com/s/1F6TfxLW-yJiMDBbMvs4l1A?pwd=av8k 提取码: av8k
–来自百度网盘超级会员v1的分享

CN是哪个版本?

这个 core 是基于 3.5.14 导出来的

Thread 532 (Thread 0x7f978de36640 (LWP 755)):
#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x7f98ae3e222c) at ./nptl/futex-internal.c:103
#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7f98ae3e222c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:128
#2  0x00007f98b369224f in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7f98ae3e2220) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x7f98ae3e2220) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x000000000bf2ee29 in std::__glibcxx_rwlock_wrlock (__rwlock=0x7f98ae3e2220) at /usr/include/c++/11/shared_mutex:80
#5  std::__shared_mutex_pthread::lock (this=0x7f98ae3e2220) at /usr/include/c++/11/shared_mutex:193
#6  std::shared_mutex::lock (this=0x7f98ae3e2220) at /usr/include/c++/11/shared_mutex:420
#7  std::unique_lock<std::shared_mutex>::lock (this=<synthetic pointer>) at /usr/include/c++/11/bits/unique_lock.h:139
#8  std::unique_lock<std::shared_mutex>::unique_lock (__m=..., this=<synthetic pointer>) at /usr/include/c++/11/bits/unique_lock.h:69
#9  starrocks::StarOSWorker::new_shared_filesystem (this=this@entry=0x7f98ae3e2190, scheme=..., conf=...) at be/src/service/staros_worker.cpp:364
#10 0x000000000bf30d34 in starrocks::StarOSWorker::build_filesystem_from_shard_info (this=this@entry=0x7f98ae3e2190, info=..., conf=...) at /usr/include/c++/11/string_view:137
#11 0x000000000bf323de in starrocks::StarOSWorker::get_shard_filesystem (this=0x7f98ae3e2190, id=83521, conf=...) at be/src/service/staros_worker.cpp:239
#12 0x00000000081ead1c in starrocks::StarletFileSystem::get_shard_filesystem (shard_id=<optimized out>, this=0x7f97b26cf000) at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#13 starrocks::StarletFileSystem::delete_dir (this=0x7f97b26cf000, dirname=...) at be/src/fs/fs_starlet.cpp:467
#14 0x000000000a7deb31 in starrocks::lake::LoadSpillBlockManager::clear_parent_path (this=this@entry=0x7f96e53f68e0) at be/src/storage/lake/load_spill_block_manager.cpp:103
#15 0x000000000a7df033 in starrocks::lake::LoadSpillBlockManager::~LoadSpillBlockManager (this=this@entry=0x7f96e53f68e0, __in_chrg=<optimized out>) at be/src/storage/lake/load_spill_block_manager.cpp:90

mutex lock: this=0x7f98ae3e2220

(gdb) p /x *this

$6 = {_M_rwlock = {__data = {__readers = 0xfffffffb, __writers = 0x0, __wrphase_futex = 0x1, __writers_futex = 0x3, __pad3 = 0x0, __pad4 = 0x0, __cur_writer = 0x2fa, __shared = 0x0, __rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, __pad2 = 0x0, __flags = 0x0}, __size = {0xfb, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0x3, 0x0 <repeats 11 times>, 0xfa, 0x2, 0x0 <repeats 30 times>}, __align = 0xfffffffb}}
当前锁被【写锁独占】(不是读锁共享)
写锁持有者线程 TID = 762
写锁被重入了 5 次
无其他线程等待写锁
(gdb) t 539
[Switching to thread 539 (Thread 0x7f979133d640 (LWP 762))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38	../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000000010266dda in bthread::futex_wait_private (timeout=0x0, expected=<optimized out>, addr1=<optimized out>) at ./src/bthread/sys_futex.h:40
#2  bthread::ParkingLot::wait (expected_state=..., this=<optimized out>) at ./src/bthread/parking_lot.h:60
#3  bthread::TaskGroup::wait_task (this=this@entry=0x7f98a8b44080, tid=tid@entry=0x7f979132d6f8) at src/bthread/task_group.cpp:133
#4  0x0000000010269d6b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f98a8b44080) at src/bthread/task_group.cpp:161
#5  0x0000000010263dc2 in bthread::TaskControl::worker_thread (arg=<optimized out>) at src/bthread/task_control.cpp:99
#6  0x00007f98b368bac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f98b371d8d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

TID 762是一个bthread worker线程. 大概率是bthread持锁又切换线程导致unlock()失效.

尝试复现crash在第一现场

export PTHREAD_MUTEX_ERRORCHECK=1

在POD里注入这个环境变量再尝试复现, mutex跨线程释放锁时, 进程会crash.

可能有更多信息排查问题.

另外, 这是一个稳定复现的问题吗? 可以看看是不是有一个最小复现步骤, 我也复现看看.

最小复现步骤,

【集群版本】 直接部署 3.5.14
【集群规模】 3个fe(node 8核心 20G 内存 50G 磁盘)
1个cn(node 16核心 64G 内存 100G 磁盘)
【Hadoop版本】 基于 Apache Ambari 2.7.5.0 部署的 hadoop 3.1.1版

cn 的 cm 配置:

# fe config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrocks-cn-cm
  namespace: bd-starrocks
  labels:
    cluster: starrocks
data:
  cn.conf: |
    JAVA_OPTS="--add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
    storage_root_path = /opt/starrocks/cn/storage/root
    spill_local_storage_dir = /opt/starrocks/cn/storage/spill
    datacache_enable = false
    datacache_mem_size = 10%
    datacache_disk_size = 107374182400
    
    mem_limit = 90%
    report_task_interval_seconds = 10
    starlet_star_cache_disk_size_percent = 20

fe 的 cm 配置

# fe config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrocks-fe-cm
  namespace: bd-starrocks
  labels:
    cluster: starrocks
data:
  fe.conf: |
    LOG_DIR = /opt/starrocks/fe/log
    DATE = "$(date +%Y%m%d-%H%M%S)"
    JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xms8192m -Xmx8192m -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE -XX:ErrorFile=${LOG_DIR}/hs_err_pid%p.log -Djava.security.policy=${STARROCKS_HOME}/conf/udf_security.policy"

    http_port = 8030
    rpc_port = 9020
    query_port = 9030
    edit_log_port = 9010
    sys_log_level = INFO
    mysql_service_nio_enabled = true
    tablet_create_timeout_second = 60

    fast_schema_evolution = true

    # config for shared-data mode
    run_mode = shared_data
    cloud_native_meta_port = 6090
    cloud_native_storage_type = S3
    enable_load_volume_from_conf = false

    enable_udf = true
    max_automatic_partition_number = 87600
    enable_statistic_collect = false
    enable_collect_full_statistic = false

hadoop 的 cm 配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: starrocks-hdfs-cm
  namespace: bd-starrocks
  labels:
    cluster: starrocks
data:
  hadoop_env.sh: |
    export HADOOP_USER_NAME="starrocks"
    export HADOOP_CLASSPATH=${STARROCKS_HOME}/lib/hadoop/common/*:${STARROCKS_HOME}/lib/hadoop/common/lib/*:${STARROCKS_HOME}/lib/hadoop/hdfs/*:${STARROCKS_HOME}/lib/hadoop/hdfs/lib/*
    
    if [ -z "${HADOOP_USER_NAME}" ]
    then
        if [ -z "${USER}" ]
        then
            export HADOOP_USER_NAME=$(id -u -n)
        else
            export HADOOP_USER_NAME=${USER}
        fi
    fi
    
    if [ ${HADOOP_CONF_DIR}"X" != "X" ]; then
        export HADOOP_CLASSPATH=${HADOOP_CONF_DIR}:${HADOOP_CLASSPATH}
    fi

  hdfs-site.xml: |
    <configuration  xmlns:xi="http://www.w3.org/2001/XInclude">
        <property>
            <name>dfs.nameservices</name>
            <value>ljx</value>
        </property>
        <property>
            <name>dfs.ha.namenodes.ljx</name>
            <value>nn1,nn2</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.ljx.nn1</name>
            <value>ljx-bd-c1-nn01.ljximing.int:8020</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.ljx.nn2</name>
            <value>ljx-bd-c1-nn02.ljximing.int:8020</value>
        </property>
        <property>
            <name>dfs.client.failover.proxy.provider.ljx</name>
            <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        </property>
        <property>
            <name>dfs.ha.automatic-failover.enabled</name>
            <value>true</value>
        </property>
    </configuration>

starrokcs 的 StarRocksCluster 配置

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
  name: starrocks
  namespace: bd-starrocks
spec:
  starRocksFeSpec:
    image: harbor.yowin.mobi/bd/starrocks-fe-ubuntu:3.5.14
    podLabels:
      app: starrocks-fe
      cluster: starrocks
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - starrocks-fe
            topologyKey: kubernetes.io/hostname
    feEnvVars:
      - name: "MYSQL_PWD"
        valueFrom:
          secretKeyRef:
            name: sr-credential
            key: password
    replicas: 3
    requests:
      cpu: 7
      memory: 16Gi
    limits:
      cpu: 7
      memory: 18Gi
    storageVolumes:
      - name: starrocks-fe-storage
        storageClassName: oci-bv
        storageSize: 100Gi
        mountPath: /opt/starrocks/fe/meta
    service:
      type: LoadBalancer
    configMapInfo:
      configMapName: starrocks-fe-cm
      resolveKey: fe.conf
    configMaps:
      - name: starrocks-hdfs-cm
        mountPath: /opt/starrocks/fe/conf/hdfs-site.xml
        subPath: "hdfs-site.xml"
      - name: starrocks-hdfs-cm
        mountPath: /opt/starrocks/fe/conf/hadoop_env.sh
        subPath: "hadoop_env.sh"

  starRocksCnSpec:
    image: harbor.yowin.mobi/bd/starrocks-cn-ubuntu:3.5.14
    podLabels:
      app: starrocks-cn
      cluster: starrocks
    cnEnvVars:
      - name: "MYSQL_PWD"
        valueFrom:
          secretKeyRef:
            name: sr-credential
            key: password
    requests:
      cpu: 14
      memory: 57Gi
    replicas: 1
    configMapInfo:
      configMapName: starrocks-cn-cm
      resolveKey: cn.conf
    configMaps:
      - name: starrocks-hdfs-cm
        mountPath: /opt/starrocks/cn/conf/hdfs-site.xml
        subPath: "hdfs-site.xml"
      - name: starrocks-hdfs-cm
        mountPath: /opt/starrocks/cn/conf/hadoop_env.sh
        subPath: "hadoop_env.sh"
    storageVolumes:
      - name: starrocks-cn-storage
        storageClassName: oci-bv
        storageSize: 100Gi
        mountPath: /opt/starrocks/cn/storage
    persistentVolumeClaimRetentionPolicy:
      whenDeleted: Retain
      whenScaled: Delete

1. 初始化集群root账号

2. 创建 volume_hdfs

CREATE STORAGE VOLUME volume_hdfs
TYPE = HDFS
LOCATIONS = ("hdfs://ljx/apps/starrocks/volume-test/")
PROPERTIES (
       "username" = "starrocks",
       "hadoop.security.authentication" = "simple"
);

SET volume_hdfs AS DEFAULT STORAGE VOLUME;

3. 创建库,表

creat database dmp;
CREATE TABLE dmp.adt_ip_country_source (
  `__dt` datetime NOT NULL COMMENT "数据时间,精确到小时",
  `ip` varchar(50) NOT NULL COMMENT "ip",
  `country` varchar(3) NOT NULL COMMENT "国家",
  `source` varchar(16) NOT NULL COMMENT "IP 来源"
) ENGINE=OLAP
COMMENT "IP 国家表, 用于 IP 活跃度统计"
PARTITION BY date_trunc('hour', __dt)
DISTRIBUTED BY HASH(`ip`, `country`) BUCKETS 1
PROPERTIES (
"compression" = "LZ4",
"datacache.enable" = "true",
"datacache.partition_duration" = "1 days",
"enable_async_write_back" = "false",
"partition_live_number" = "840",
"replication_num" = "1",
"storage_volume" = "volume_hdfs"
);

4.当执行如下语句很快复现, 问题一旦发生, 在不重启 cn 和 fe 时, 后面每次执行都会触发卡住

XPLAIN  ANALYZE INSERT INTO adt_ip_country_source (__dt,ip,country,source) SELECT DATE_ADD('2026-02-01 00:00:00', INTERVAL d hour),'','','' FROM table(generate_series(0, 23)) AS g(d);
XPLAIN  ANALYZE INSERT INTO adt_ip_country_source (__dt,ip,country,source) SELECT DATE_ADD('2026-02-02 00:00:00', INTERVAL d hour),'','','' FROM table(generate_series(0, 23)) AS g(d);
#... 我就不贴具体语句了, 修改其中日期列, 一天一条, 然后放到 mysql 客户端中连续逐条执行, 很快就触发,一般 10 条后就卡住了
#... 使用 insert into 一个新分区的数据也会触发, 但触发率低




1赞

我试试看.

前段时间病了 :joy: :joy:
加上这个 进行 测试,
第一次执行sql
XPLAIN ANALYZE INSERT INTO adt_ip_country_source (__dt,ip,country,source) SELECT DATE_ADD('2026-02-01 00:00:00', INTERVAL d hour),'','','' FROM table(generate_series(0, 23)) AS g(d);

cn 进程确实直接就 crash 了, crash 之前的日志


I20260413 07:20:24.693708 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d3b5d76-3709-11f1-a4e9-1e2de80dab7d, txn_id: 266461, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
I20260413 07:20:24.698909 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d3b5d76-3709-11f1-a4e9-1e2de80dab7d, txn_id: 266461, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
I20260413 07:20:24.703985 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d3b5d76-3709-11f1-a4e9-1e2de80dab7d, txn_id: 266461, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
I20260413 07:20:24.709102 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d3b5d76-3709-11f1-a4e9-1e2de80dab7d, txn_id: 266461, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
I20260413 07:20:24.738090 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d42d7de-3709-11f1-aecf-bedc8990b519, txn_id: 266462, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
I20260413 07:20:24.764576 140520190547520 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: 3d476bc1-3709-11f1-aecf-bedc8990b519, txn_id: 266463, add chunk time(ms)/wait lock time(ms)/num: {296476:(0)(0)(1)}
W20260413 07:20:28.208846 140517568968256 stack_util.cpp:437] 2026-04-13 07:20:28.208818, query_id=00000000-0000-0000-0000-000000000000, fragment_instance_id=00000000-0000-0000-0000-000000000000 throws exception: std::system_error, trace:
     @          0xc106faf  __wrap___cxa_throw
    @         0x140619b8  std::__throw_system_error(int)
    @          0xbf2f2da  starrocks::StarOSWorker::new_shared_filesystem(std::basic_string_view<char, std::char_traits<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<charB^B
    @          0xbf30d34  starrocks::StarOSWorker::build_filesystem_from_shard_info(staros::starlet::ShardInfo const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::aB^B
    @          0xbf323de  starrocks::StarOSWorker::get_shard_filesystem(unsigned long, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std:B^B
    @          0x81ead1c  starrocks::StarletFileSystem::delete_dir(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    @          0xa7deb31  starrocks::lake::LoadSpillBlockManager::clear_parent_path()
    @          0xa7df033  starrocks::lake::LoadSpillBlockManager::~LoadSpillBlockManager()
    @          0xbcaa96f  starrocks::lake::DeltaWriter::~DeltaWriter()
    @          0xbef9cab  starrocks::lake::AsyncDeltaWriter::~AsyncDeltaWriter()
    @          0xbeee5b8  starrocks::LakeTabletsChannel::~LakeTabletsChannel()
    @          0x8103d3a  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
    @          0xbdea345  starrocks::LoadChannel::_add_chunk(starrocks::Chunk*, starrocks::MonotonicStopWatch const*, starrocks::PTabletWriterAddChunkRequest const&, starrocks::PTabletWriterAddBatchResult*)
    @          0xbdeb545  starrocks::LoadChannel::add_chunks(starrocks::PTabletWriterAddChunksRequest const&, starrocks::PTabletWriterAddBatchResult*)
    @          0xbddf054  starrocks::LoadChannelMgr::add_chunks(starrocks::PTabletWriterAddChunksRequest const&, starrocks::PTabletWriterAddBatchResult*)
    @          0xbfafb41  starrocks::BackendInternalServiceImpl<starrocks::PInternalService>::tablet_writer_add_chunks(google::protobuf::RpcController*, starrocks::PTabletWriterAddChunksRequest const*, starrocks::PTabletWriterAddBatchResult*, google::protobuf::Closure*)
    @         0x1038f253  brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*)
    @         0x102b493b  brpc::ProcessInputMessage(void*)
    @         0x102b5d84  brpc::InputMessenger::OnNewMessages(brpc::Socket*)
    @         0x102f26a2  brpc::Socket::ProcessEvent(void*)
    @         0x10269837  bthread::TaskGroup::task_runner(long)
    @         0x10252821  bthread_make_fcontext

I20260413 07:20:37.578325 139633710034112 daemon.cpp:344]  version 3.5.14-23a56ec

再多次尝试执行 sql
XPLAIN ANALYZE INSERT INTO adt_ip_country_source (__dt,ip,country,source) SELECT DATE_ADD('2026-02-01 00:00:00', INTERVAL d hour),'','','' FROM table(generate_series(0, 23)) AS g(d);

没有再出先 crash 的情况, 一直卡着不动, 无法创建出新的分区

有crash的stack trace吗?

当时没有做 stack trace
没有再复现出 crash 的情况了, 感觉那个 crash 可能是内部 static 语句触发的.

我看出了 3.5.15版本, 一会我去试试新版本会不会复现此问题

跟我的怀疑点一样的. https://github.com/StarRocks/starrocks/pull/70778

等3.5.16版本出来后再测试验证

刚测试了 3.5.15版本, 有一样的问题,
单独插入几条数据还不能触发,
insert 或者 XPLAIN ANALYZE 这样语句, 如果触发了建立多个分区的场景, 很容易触发

谢谢您, 我等 3.5.16 出来再测试一下

在当前版本里, 可以通过设置CN参数 enable_load_spill = false 关闭load_spill能力, 避免走到LoadSpillBlockManager的代码. 应该就不会卡住了.

刚刚测试了 3.5.16 版本, 问题已经修复. :slight_smile:

1赞