routine load任务导致cn轮流频繁重启

500+routine load任务,昨天下午开始出现cn节点轮流重启,次数十分频繁。根据上次的经验,停掉所有routine load任务后cn节点恢复正常不再重启。(ps:上次定位到routine load一个任务表的string内容含有特殊字符导致,本次大概率也是该情况)后续线上再次开启全部routine load任务,情况并未发生。时间点在8-26 下午16点38,然后就卡住了,后续k8s健康检查并未通过后服务进行自动重启。

E0826 16:38:35.767621 826 scan_operator.cpp:436] scan fragment 924dbbe2-6386-11ef-be9f-92596f1d8048 driver 0 Scan tasks error: Internal error: starlet err grpc.GetShard(shardId=5920885) error: Deadline Exceeded
be/src/storage/rowset/segment.cpp:235 value_or_err_L235
be/src/storage/lake/tablet_manager.cpp:621 segment->open(footer_size_hint, nullptr, lake_io_opts)
be/src/storage/lake/rowset.cpp:121 load_segments(&segments, options.lake_io_opts.fill_data_cache, options.lake_io_opts.buffer_size)
be/src/storage/lake/tablet_reader.cpp:281 value_or_err_L281
be/src/storage/lake/tablet_reader.cpp:379 get_segment_iterators(params, &seg_iters)
be/src/connector/lake_connector.cpp:350 _reader->open(_params)
be/src/connector/lake_connector.cpp:91 init_tablet_reader(_runtime_state)
be/src/exec/pipeline/scan/connector_scan_operator.cpp:766 _data_source->open(state)
be/src/exec/pipeline/scan/connector_scan_operator.cpp:793 _open_data_source(state, &mem_alloc_failed)
E0826 16:38:35.796646 646 scan_operator.cpp:436] scan fragment 92047e26-6386-11ef-b050-ba2974846dc3 driver 1 Scan tasks error: Internal error: starlet err grpc.GetShard(shardId=5756152) error: Deadline Exceeded
be/src/storage/rowset/segment_iterator.cpp:576 value_or_err_L576
be/src/storage/rowset/segment_iterator.cpp:636 _init_column_iterator_by_cid(cid, f->uid(), check_dict_enc)
be/src/storage/rowset/segment_iterator.cpp:442 _init_column_iterators(_schema)
be/src/storage/rowset/segment_iterator.cpp:1045 _init()
be/src/storage/lake/tablet_reader.cpp:192 _collect_iter->get_next(chunk)
be/src/connector/lake_connector.cpp:117 _prj_iter->get_next(chunk_ptr)
W0826 16:38:35.799453 580 pipeline_driver.cpp:311] pull_chunk returns not ok status Internal error: starlet err grpc.GetShard(shardId=5756152) error: Deadline Exceeded
be/src/storage/rowset/segment_iterator.cpp:576 value_or_err_L576
be/src/storage/rowset/segment_iterator.cpp:636 _init_column_iterator_by_cid(cid, f->uid(), check_dict_enc)
be/src/storage/rowset/segment_iterator.cpp:442 _init_column_iterators(_schema)
be/src/storage/rowset/segment_iterator.cpp:1045 _init()
be/src/storage/lake/tablet_reader.cpp:192 _collect_iter->get_next(chunk)
be/src/connector/lake_connector.cpp:117 _prj_iter->get_next(chunk_ptr)
be/src/exec/pipeline/scan/scan_operator.cpp:250 _get_scan_status()
W0826 16:38:35.799477 580 pipeline_driver_executor.cpp:168] [Driver] Process error, query_id=92047e26-6386-11ef-b050-ba2974846dbd, instance_id=92047e26-6386-11ef-b050-ba2974846dc3, status=Internal error: starlet err grpc.GetShard(shardId=5756152) error: Deadline Exceeded
be/src/storage/rowset/segment_iterator.cpp:576 value_or_err_L576
be/src/storage/rowset/segment_iterator.cpp:636 _init_column_iterator_by_cid(cid, f->uid(), check_dict_enc)
be/src/storage/rowset/segment_iterator.cpp:442 _init_column_iterators(_schema)
be/src/storage/rowset/segment_iterator.cpp:1045 _init()
be/src/storage/lake/tablet_reader.cpp:192 _collect_iter->get_next(chunk)
be/src/connector/lake_connector.cpp:117 _prj_iter->get_next(chunk_ptr)
be/src/exec/pipeline/scan/scan_operator.cpp:250 _get_scan_status()
I0826 16:38:35.803100 552 pipeline_driver_executor.cpp:354] [Driver] Succeed to report exec state: fragment_instance_id=92047e26-6386-11ef-b050-ba2974846dc0, is_done=1
I0826 16:38:35.803112 228774 pipeline_driver_executor.cpp:354] [Driver] Succeed to report exec state: fragment_instance_id=92047e26-6386-11ef-b050-ba2974846dc3, is_done=1
I0826 16:44:21.935317 27 daemon.cpp:291] version 3.3.0-19a3f66
BuildType: RELEASE
Build distributor id: ubuntu
Built on 2024-06-21 11:04:45 by StarRocks@localhost (Ubuntu 22.04.3 LTS)
I0826 16:44:21.972420 27 mem_info.cpp:153] Init mem info by container’s cgroup config, physical_mem=128849018880
I0826 16:44:21.972446 27 mem_info.cpp:104] Physical Memory: 120.00 GB
I0826 16:44:21.972456 27 daemon.cpp:297] Cpu Info:

【是否存算分离】是
【StarRocks版本】3.3.0
【集群规模】3fe 3cn
【机器信息】fe-16c32g cn-32c 128g

【附件】
http://sivaoy2hs.hn-bkt.clouddn.com/cn.INFO.log.20240826-154937

http://sivaoy2hs.hn-bkt.clouddn.com/cn.WARNING

http://sivaoy2hs.hn-bkt.clouddn.com/cn.out