routine load 任务每天凌晨1点,上午 9 10点左右,状态会从running变为PAUSED

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】routine load 任务每天凌晨1点,上午 9 10点左右,状态会从running变为PAUSED
【背景】最开始部署的时候未把 元数据和数据目录存放在 数据盘,近期做过 fe be的元数据和数据目录迁移,5月26日做的迁移,28号开始出问题 每天凌晨1点,上午 9 10点左右,状态会从running变为PAUSED
【业务影响】 被投诉
【是否存算分离】 存算一体、混合部署
【StarRocks版本】3.4.1-2f78e09
【集群规模】3fe(1leader 2 follower)+ 3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:6C/64G/万兆
【联系方式】社区群26-张宇辉 邮箱 654561513@qq.com,谢谢
【附件】
ReasonOfStateChanged:
错误一: ErrorReason{errCode = 104, msg=‘be 10003 abort task with reason: kafka consume failed, err: fetch failed due to requested offset not available on the broker: Broker: Offset out of range (broker 3)’}

错误二: ErrorReason{errCode = 104, msg=‘FE aborts the task with reason: failed to check task ready to execute, err: Consume offset: 885003 is greater than the latest offset: 883238 in kafka partition: 1. You can modify ‘kafka_offsets’ property through ALTER ROUTINE LOAD and RESUME the job’}

LatestSourcePosition: {“0”:“889833”,“1”:“883238”,“2”:“890369”}
Progress: {“0”:“889832”,“1”:“885002”,“2”:“890369”}
kafka offset:
[root@ bin]# ./kafka-consumer-groups.sh --bootstrap-server 10.120.7.102:9092 --describe --group group_starrock_20250529

Consumer group ‘group_starrock_20250529’ has no active members.

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
group_starrock_20250529 equipment_signal_log_topic 2 890370 898518 8148 - - -
group_starrock_20250529 equipment_signal_log_topic 1 885003 893150 8147 - - -
group_starrock_20250529 equipment_signal_log_topic 0 889833 898013 8180 - - -

kafka正常、Progress: {“0”:“889832”,“1”:“885002”,“2”:“890369”} ,为什么 LatestSourcePosition 不正常?
查看任务分配情况,发现 BeId 为 -1,但是任务确实一直再跑,数据也在 正常同步



FE正常

BE正常

CREATE ROUTINE LOAD ods.rload__device_signal_log_topic ON ods_kfk__device_signal_log_i
COLUMNS(process,workshop,line_type,equipment_code,log_type,type,rfid,big_boat,boat,runid_txt,wafer_id_txt,disconnection_time,restore_time,recording_time,_up_time,new_type=CASE WHEN log_type = 1 THEN CASE WHEN length(restore_time) > 0 THEN 51 ELSE 50 END ELSE type END,ods_create_time=current_timestamp())
PROPERTIES(
“max_batch_rows” = “200000”,
“desired_concurrent_number” = “3”,
“jsonpaths” = “[”$.process","$.workshop","$.lineType","$.equipment_code","$.logType","$.type","$.rfid","$.bigBoat","$.boat","$.runIdTxt","$.waferIdTxt","$.disconnectionTime","$.restoreTime","$.recordingTime","$._up_time"]",
“format” = “json”,
“max_error_number” = “0”,
“max_batch_interval” = “10”
)
FROM KAFKA (
“property.kafka_default_offsets” = “OFFSET_BEGINNING”,
“kafka_broker_list” = “xxxx:9092”,
“kafka_topic” = “equipment_signal_log_topic”,
“property.group.id” = “group_starrock_20250529”
);

表语句为 明细模型

routine load 任务每天凌晨1点,上午 9 10点左右,状态会从running变为PAUSED 有没有其他sql操作SR?

没有,前两天查看日志,好像是网络问题

,有大佬分享了一个参数,设置之后好像没问题了
动态生效
admin set frontend config (“thrift_rpc_timeout_ms”=“60000”);
UPDATE information_schema.be_configs SET value = 60000 WHERE name =‘thrift_rpc_timeout_ms’;
UPDATE information_schema.be_configs SET value = 60000 WHERE name =‘txn_commit_rpc_timeout_ms’;
持久化在配置文件
FE:thrift_rpc_timeout_ms = 60000
BE:thrift_rpc_timeout_ms= 60000
BE:txn_co3mmit_rpc_timeout_ms = 60000