routine load 报错 Reached timeout

版本3.2.2 存算分离 通过kafka同步,服务器是aws云
同步是异常的,无法正常同步,有时手动resume routine load后同步一些数据,然后又没数据了,但是状态还是running

routine load 报错

                  Id: 977383
                Name: risk_ip_site_agg
          CreateTime: 2024-02-06 19:26:05
           PauseTime: NULL
             EndTime: NULL
              DbName: prod_risk
           TableName: risk_ip_site_agg
               State: RUNNING
      DataSourceType: KAFKA
      CurrentTaskNum: 3
       JobProperties: {"partitions":"*","rowDelimiter":"\t","partial_update":"false","columnToColumnExpr":"ip_addr,site_code,ori_user,ori_fail_user,ip_part=abs(murmur_hash3_32(ip_addr)) % 64,users=to_bitmap(ori_user),fail_req_users=to_bitmap(ori_fail_user)","maxBatchIntervalS":"10","partial_update_mode":"null","whereExpr":"*","timezone":"Asia/Hong_Kong","format":"csv","columnSeparator":"','","log_rejected_record_num":"0","taskTimeoutSecond":"60","json_root":"","maxFilterRatio":"1.0","strict_mode":"false","jsonpaths":"","taskConsumeSecond":"15","desireTaskConcurrentNum":"3","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"3","maxBatchRows":"200000"}
DataSourceProperties: {"topic":"sync_risk_ip_site_agg","currentKafkaPartitions":"0,1,2,3,4,5,6,7,8,9","brokerList":"10.193.6.72:9092,10.193.7.8:9092,10.193.6.68:9092"}
    CustomProperties: {"security.protocol":"SASL_PLAINTEXT","sasl.username":"betuser","sasl.mechanism":"PLAIN","kafka_default_offsets":"OFFSET_END","group.id":"risk_ip_site_agg_c4b15041-9111-473d-8209-b2950a65cbe2","sasl.password":"******"}
           Statistic: {"receivedBytes":37304362,"errorRows":0,"committedTaskNum":4,"loadedRows":1100257,"loadRowsRate":7000,"abortedTaskNum":1158,"totalRows":1100257,"unselectedRows":0,"receivedBytesRate":249000,"taskExecuteTimeMs":149379}
            Progress: {"0":"300114","1":"203537","2":"389332","3":"385506","4":"203836","5":"389097","6":"384457","7":"202339","8":"388018","9":"289559"}
ReasonOfStateChanged: 
        ErrorLogUrls: 
         TrackingSQL: 
            OtherMsg: 2024-02-07 09:18:43: [E1008]Reached timeout=30000ms @10.193.30.207:8060
                  Id: 977328
                Name: risk_ip_agg
          CreateTime: 2024-02-06 19:25:28
           PauseTime: NULL
             EndTime: NULL
              DbName: prod_risk
           TableName: risk_ip_agg
               State: RUNNING
      DataSourceType: KAFKA
      CurrentTaskNum: 3
       JobProperties: {"partitions":"*","rowDelimiter":"\t","partial_update":"false","columnToColumnExpr":"ip_addr,ori_site,ori_user,ori_fail_user,ip_part=abs(murmur_hash3_32(ip_addr)) % 64,sites=to_bitmap(ori_site),users=to_bitmap(ori_user),fail_req_users=to_bitmap(ori_fail_user)","maxBatchIntervalS":"10","partial_update_mode":"null","whereExpr":"*","timezone":"Asia/Hong_Kong","format":"csv","columnSeparator":"','","log_rejected_record_num":"0","taskTimeoutSecond":"60","json_root":"","maxFilterRatio":"1.0","strict_mode":"false","jsonpaths":"","taskConsumeSecond":"15","desireTaskConcurrentNum":"3","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"3","maxBatchRows":"200000"}
DataSourceProperties: {"topic":"sync_risk_ip_site_agg","currentKafkaPartitions":"0,1,2,3,4,5,6,7,8,9","brokerList":"10.193.6.72:9092,10.193.7.8:9092,10.193.6.68:9092"}
    CustomProperties: {"security.protocol":"SASL_PLAINTEXT","sasl.username":"betuser","sasl.mechanism":"PLAIN","kafka_default_offsets":"OFFSET_END","group.id":"risk_ip_agg_10ec69cb-2ec8-4e36-8485-c30f018eebbf","sasl.password":"******"}
           Statistic: {"receivedBytes":66865024,"errorRows":0,"committedTaskNum":6,"loadedRows":1972209,"loadRowsRate":6000,"abortedTaskNum":952,"totalRows":1972209,"unselectedRows":0,"receivedBytesRate":218000,"taskExecuteTimeMs":305696}
            Progress: {"0":"303934","1":"384602","2":"497530","3":"384480","4":"383214","5":"496090","6":"382785","7":"381417","8":"495188","9":"281821"}
ReasonOfStateChanged: 
        ErrorLogUrls: 
         TrackingSQL: 
            OtherMsg: 2024-02-07 09:33:18: Cancelled because of runtime state is cancelled
10.193.31.143:prod_risk 12:51:32>SHOW ROUTINE LOAD TASK FROM prod_risk where JobName = "risk_ip_agg" \G;
*************************** 1. row ***************************
              TaskId: a96f1d43-9977-4e50-a9f4-c366ee373164
               TxnId: 6020
           TxnStatus: UNKNOWN
               JobId: 977328
          CreateTime: 2024-02-07 12:50:44
   LastScheduledTime: 2024-02-07 12:50:54
    ExecuteStartTime: 2024-02-07 12:50:54
             Timeout: 60
                BeId: 10117
DataSourceProperties: Progress:{"2":588520,"5":586738,"8":585579},LatestOffset:{"8":4955876,"5":4958054,"2":4954003}
             Message: task submitted to execute
*************************** 2. row ***************************
              TaskId: 168713e6-d88c-41d4-9ee4-26e8ec180803
               TxnId: 6024
           TxnStatus: UNKNOWN
               JobId: 977328
          CreateTime: 2024-02-07 12:50:56
   LastScheduledTime: 2024-02-07 12:51:06
    ExecuteStartTime: 2024-02-07 12:51:07
             Timeout: 60
                BeId: 10118
DataSourceProperties: Progress:{"0":351569,"3":475097,"6":473316,"9":325120},LatestOffset:{"0":4957627,"9":4955886,"3":4958223,"6":4954565}
             Message: task submitted to execute
*************************** 3. row ***************************
              TaskId: 492963d1-324c-430f-9c54-46e6e72093ea
               TxnId: 6023
           TxnStatus: UNKNOWN
               JobId: 977328
          CreateTime: 2024-02-07 12:50:56
   LastScheduledTime: 2024-02-07 12:51:06
    ExecuteStartTime: 2024-02-07 12:51:07
             Timeout: 60
                BeId: 10116
DataSourceProperties: Progress:{"1":609803,"4":607813,"7":605763},LatestOffset:{"4":4960785,"1":4957671,"7":4954137}
             Message: task submitted to execute
3 rows in set (0.00 sec)
cn上的报错

/build/starrocks/be/src/exec/tablet_sink_index_channel.cpp:853 _wait_all_prev_request()
W0206 19:45:50.524520 37534 fragment_mgr.cpp:327] Retrying ReportExecStatus: No more data to read.
W0206 19:45:50.526177 37534 fragment_mgr.cpp:201] Fail to open fragment 582d6bd6-2914-4099-b5a0-434b5d7a8d8f: Internal error: [E1008]Reached timeout=30000ms @10.193.30.207:8060
/build/starrocks/be/src/exec/tablet_sink_index_channel.cpp:757 _wait_request(closure)
/build/starrocks/be/src/exec/tablet_sink_index_channel.cpp:853 _wait_all_prev_request()
W0206 19:45:50.526273 37534 stream_load_executor.cpp:111] fragment execute failed, query_id=582d6bd629144099-b5a0434b5d7a8d8e, err_msg=[E1008]Reached timeout=30000ms @10.193.30.207:8060, id=582d6bd629144099-b5a0434b5d7a8d8e, job_id=977328, txn_id: 361, label=risk_ip_agg-977328-582d6bd6-2914-4099-b5a0-434b5d7a8d8e, db=prod_risk
W0206 19:45:50.526319 73790 routine_load_task_executor.cpp:505] consume failed id=582d6bd629144099-b5a0434b5d7a8d8e, job_id=977328, txn_id: 361, label=risk_ip_agg-977328-582d6bd6-2914-4099-b5a0-434b5d7a8d8e, db=prod_risk
W0206 19:45:51.074395 37987 agent_task.cpp:224] create table failed. status: Invalid argument: starlet err Invalid sys.root configuration provided!
/build/starrocks/be/src/storage/protobuf_file.cpp:115 value_or_err_L115
/build/starrocks/be/src/storage/lake/tablet_manager.cpp:194 file.save(*metadata), signature: 978962

您这个集群大概有多少个routine load任务 ? 然后麻烦再确认下这个参数当前配置的是多大 max_routine_load_task_num_per_be

集群有3个routine load 任务,max_routine_load_task_num_per_be是默认值 16,可以留个联系方式吗?这套系统是给生产用的,问题比较紧急

ADMIN SHOW FRONTEND CONFIG like "max_routine_load_task_num_per_be";
+----------------------------------+------------+-------+------+-----------+---------+
| Key                              | AliasNames | Value | Type | IsMutable | Comment |
+----------------------------------+------------+-------+------+-----------+---------+
| max_routine_load_task_num_per_be | []         | 16    | int  | true      |         |
+----------------------------------+------------+-------+------+-----------+---------+
1 row in set (0.00 sec)