routine load 任务每天凌晨1点，上午 9 10点左右，状态会从running变为PAUSED

yuhui_zhang · 2025年05月30日 18:27

为了更快的定位您的问题，请提供以下信息，谢谢
【详述】routine load 任务每天凌晨1点，上午 9 10点左右，状态会从running变为PAUSED
【背景】最开始部署的时候未把元数据和数据目录存放在数据盘，近期做过 fe be的元数据和数据目录迁移，5月26日做的迁移，28号开始出问题每天凌晨1点，上午 9 10点左右，状态会从running变为PAUSED
【业务影响】被投诉
【是否存算分离】存算一体、混合部署
【StarRocks版本】3.4.1-2f78e09
【集群规模】3fe（1leader 2 follower）+ 3be（fe与be混部）
【机器信息】CPU虚拟核/内存/网卡，例如：6C/64G/万兆
【联系方式】社区群26-张宇辉邮箱 654561513@qq.com，谢谢
【附件】
ReasonOfStateChanged：
错误一： ErrorReason{errCode = 104, msg=‘be 10003 abort task with reason: kafka consume failed, err: fetch failed due to requested offset not available on the broker: Broker: Offset out of range (broker 3)’}

错误二： ErrorReason{errCode = 104, msg=‘FE aborts the task with reason: failed to check task ready to execute, err: Consume offset: 885003 is greater than the latest offset: 883238 in kafka partition: 1. You can modify ‘kafka_offsets’ property through ALTER ROUTINE LOAD and RESUME the job’}

LatestSourcePosition： {“0”:“889833”,“1”:“883238”,“2”:“890369”}
Progress： {“0”:“889832”,“1”:“885002”,“2”:“890369”}
kafka offset:
[root@ bin]# ./kafka-consumer-groups.sh --bootstrap-server 10.120.7.102:9092 --describe --group group_starrock_20250529

Consumer group ‘group_starrock_20250529’ has no active members.

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
group_starrock_20250529 equipment_signal_log_topic 2 890370 898518 8148 - - -
group_starrock_20250529 equipment_signal_log_topic 1 885003 893150 8147 - - -
group_starrock_20250529 equipment_signal_log_topic 0 889833 898013 8180 - - -

kafka正常、Progress： {“0”:“889832”,“1”:“885002”,“2”:“890369”} ，为什么 LatestSourcePosition 不正常？
查看任务分配情况，发现 BeId 为 -1，但是任务确实一直再跑，数据也在正常同步

FE正常

BE正常

yuhui_zhang · 2025年05月30日 18:10

CREATE ROUTINE LOAD ods.rload__device_signal_log_topic ON ods_kfk__device_signal_log_i
COLUMNS(process,workshop,line_type,equipment_code,log_type,type,rfid,big_boat,boat,runid_txt,wafer_id_txt,disconnection_time,restore_time,recording_time,_up_time,new_type=CASE WHEN log_type = 1 THEN CASE WHEN length(restore_time) > 0 THEN 51 ELSE 50 END ELSE type END,ods_create_time=current_timestamp())
PROPERTIES(
“max_batch_rows” = “200000”,
“desired_concurrent_number” = “3”,
“jsonpaths” = “[”$.process","$.workshop","$.lineType","$.equipment_code","$.logType","$.type","$.rfid","$.bigBoat","$.boat","$.runIdTxt","$.waferIdTxt","$.disconnectionTime","$.restoreTime","$.recordingTime","$._up_time"]",
“format” = “json”,
“max_error_number” = “0”,
“max_batch_interval” = “10”
)
FROM KAFKA (
“property.kafka_default_offsets” = “OFFSET_BEGINNING”,
“kafka_broker_list” = “xxxx:9092”,
“kafka_topic” = “equipment_signal_log_topic”,
“property.group.id” = “group_starrock_20250529”
);

表语句为明细模型

andyhdic · 2025年06月3日 07:37

routine load 任务每天凌晨1点，上午 9 10点左右，状态会从running变为PAUSED 有没有其他sql操作SR？

yuhui_zhang · 2025年06月4日 06:04

没有，前两天查看日志，好像是网络问题

，有大佬分享了一个参数，设置之后好像没问题了
动态生效
admin set frontend config (“thrift_rpc_timeout_ms”=“60000”);
UPDATE information_schema.be_configs SET value = 60000 WHERE name =‘thrift_rpc_timeout_ms’;
UPDATE information_schema.be_configs SET value = 60000 WHERE name =‘txn_commit_rpc_timeout_ms’;
持久化在配置文件
FE：thrift_rpc_timeout_ms = 60000
BE：thrift_rpc_timeout_ms= 60000
BE：txn_co3mmit_rpc_timeout_ms = 60000