routine load不消费kafka数据

【详述】线上一套StarRocks集群每隔几天就有不同的几个routine load因消费慢导致超过kafka数据保留时间,最终routine load中断,报错如下:
ReasonOfStateChanged: ErrorReason{code=errCode = 104, msg=‘be 10004 abort task with reason: fetch failed due to requested offset not available on the broker: Broker: Offset out of range’}
【是否存算分离】否
【StarRocks版本】例如:2.3.14
【集群规模】 5个fe-16核-64G, 10个be-16核-64G
【集群routine load特点】:整个集群一共有26个routine load,一共对应写入3张表,且这26个routine load对应的kafka topic都是同一个。
其中
a) 24个routine load按照topic的24个分区进行了一一对应拆分,对应的表模型是DUPLICATE KEY
b) 另一个routine load对应上面a)相同的topic,对应的表模型是AGGREGATE KEY
d) 最后一个routine load对应上面a)相同的topic,对应的表模型是DUPLICATE KEY

其中a) b) c) 对应的表结构是不一样的

请问该如何进行routine load优化呢?从而保障routine load不会消费延迟呢?

@jingdan @trueeyu @dongquan 帮看下吗? 谢谢

消费延迟需要找一个任务去leader fe和be日志里面去搜下相关上下文日志

fe leader的fe.log中过滤routine load,有很多的这种日志,请问这是什么原因呢?另外be日志如何搜索上下文呢?
2024-08-15 16:50:12,905 WARN (thrift-server-pool-58702|2541153) [RoutineLoadJob.executeTaskOnTxnStatusChanged():1013] routine load task [job name load_34300_t_monitor_logs_20230922, task id 759017fb-642d-4e6f-971c-e00c4f9d29f9] aborted because of Cancelled, remove old task and generate new one

根据txn id搜,所有写入都有唯一的txn id

根据其中一个task_id过滤be节点的be.INFO文件,日志如下,请问这个是什么原因Cancelled呢了?

I0815 16:50:09.883594 669524 routine_load_task_executor.cpp:221] submit a new routine load task: id=759017fb642d4e6f-971ce00c4f9d29f9, job_id=54095, txn_id: 78176435, label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, db=default_cluster:beidou_rslog_database, current tasks num: 4
I0815 16:50:09.883602 475401 routine_load_task_executor.cpp:238] begin to execute routine load task: id=759017fb642d4e6f-971ce00c4f9d29f9, job_id=54095, txn_id: 78176435, label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, db=default_cluster:beidou_rslog_database
I0815 16:50:09.885869 475401 stream_load_executor.cpp:57] begin to execute job. label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, txn_id: 78176435, query_id=759017fb-642d-4e6f-971c-e00c4f9d29f9
I0815 16:50:09.885915 475401 plan_fragment_executor.cpp:68] Prepare(): query_id=759017fb-642d-4e6f-971c-e00c4f9d29f9 fragment_instance_id=759017fb-642d-4e6f-971c-e00c4f9d29fa backend_num=0
I0815 16:50:09.887172 475401 data_consumer_group.cpp:101] start consumer group: 9f425d76b83ec27e-a662455b76f60097. max time(ms): 3000, batch size: 524288000. id=759017fb642d4e6f-971ce00c4f9d29f9, job_id=54095, txn_id: 78176435, label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, db=default_cluster:beidou_rslog_database
W0815 16:50:12.889364 475343 stream_load_executor.cpp:89] fragment execute failed, query_id=759017fb642d4e6f-971ce00c4f9d29f9, err_msg=Cancelled, id=759017fb642d4e6f-971ce00c4f9d29f9, job_id=54095, txn_id: 78176435, label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, db=default_cluster:beidou_rslog_database
I0815 16:50:12.912119 475401 routine_load_task_executor.cpp:207] finished routine load task id=759017fb642d4e6f-971ce00c4f9d29f9, job_id=54095, txn_id: 78176435, label=load_34300_t_monitor_logs_20230922-54095-759017fb-642d-4e6f-971c-e00c4f9d29f9-78176435, db=default_cluster:beidou_rslog_database, status: Cancelled, current tasks num: 2

有可能是这个3s没消费完,你去leader fe搜下78176435这个txn

grep 78176435 fe.log

2024-08-15 16:50:09,874 INFO (pool-17-thread-2|660) [DatabaseTransactionMgr.beginTransaction():301] begin transaction: txn_id: 78176435 with label 759017fb-642d-4e6f-971c-e00c4f9d29f9 from coordinator FE: 10.176.132.143, listner id: 54095
2024-08-15 16:50:12,897 INFO (thrift-server-pool-58702|2541153) [FrontendServiceImpl.loadTxnRollback():1046] receive txn rollback request. db: default_cluster:beidou_rslog_database, tbl: t_monitor_logs_20230922, txn_id: 78176435, reason: Cancelled, backend: 10.176.78.22

请问有fe或者be参数可以控制max time(ms): 3000吗?

fe的配置routine_load_task_consume_second,默认3

嗯,我修改下这个fe参数再试下。
另外通过上面的日志可以确定是routine load批次消费超过3s导致routine load aborted because of Cancelled吗?

上面be的日志是这个节点的吧

从这里看应该是这个问题

1赞

是的。

好的,我修改下fe参数试试routine_load_task_consume_second,感谢

@jingdan 将fe参数routine_load_task_consume_second调整为15之后,fe leader的fe.log又出现了大量的关于timeout by txn manager的日志, 请问这个是什么原因呢?有什么解决方案呢?

grep 78442349 fe.log

2024-08-16 13:49:50,028 INFO (pool-17-thread-8|666) [DatabaseTransactionMgr.beginTransaction():301] begin transaction: txn_id: 78442349 with label 3fd6cf18-1505-435f-bc91-88aab6a8c0bc from coordinator FE: 10.176.132.143, listner id: 67422
2024-08-16 13:50:05,032 INFO (txnTimeoutChecker|83) [DatabaseTransactionMgr.abortTimeoutTxns():1525] transaction [78442349] is timeout, abort it by transaction manager
2024-08-16 13:50:05,405 INFO (thrift-server-pool-58706|2541157) [FrontendServiceImpl.loadTxnCommit():936] receive txn commit request. db: default_cluster:beidou_rslog_database, tbl: t_monitor_logs_date, txn_id: 78442349, backend: 10.176.137.14
2024-08-16 13:50:05,405 WARN (thrift-server-pool-58706|2541157) [FrontendServiceImpl.loadTxnCommit():958] failed to commit txn_id: 78442349: timeout by txn manager
2024-08-16 13:50:05,405 INFO (thrift-server-pool-58206|2528159) [FrontendServiceImpl.loadTxnRollback():1046] receive txn rollback request. db: default_cluster:beidou_rslog_database, tbl: t_monitor_logs_date, txn_id: 7844234, reason: timeout by txn manager, backend: 10.176.137.14
2024-08-16 13:50:05,406 WARN (thrift-server-pool-58206|2528159) [FrontendServiceImpl.loadTxnRollback():1064] failed to rollback txn 78442349: transaction not found

grep 3fd6cf18-1505-435f-bc91-88aab6a8c0bc be.INFO

I0816 13:49:50.023895 128493 routine_load_task_executor.cpp:221] submit a new routine load task: id=3fd6cf181505435f-bc9188aab6a8c0bc, job_id=67422, txn_id: 78442349, label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, db=default_cluster:beidou_rslog_database, current tasks num: 4
I0816 13:49:50.023911 620331 routine_load_task_executor.cpp:238] begin to execute routine load task: id=3fd6cf181505435f-bc9188aab6a8c0bc, job_id=67422, txn_id: 78442349, label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, db=default_cluster:beidou_rslog_database
I0816 13:49:50.024528 620331 stream_load_executor.cpp:57] begin to execute job. label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, txn_id: 78442349, query_id=3fd6cf18-1505-435f-bc91-88aab6a8c0bc
I0816 13:49:50.024559 620331 plan_fragment_executor.cpp:68] Prepare(): query_id=3fd6cf18-1505-435f-bc91-88aab6a8c0bc fragment_instance_id=3fd6cf18-1505-435f-bc91-88aab6a8c0bd backend_num=0
I0816 13:49:50.025877 620331 data_consumer_group.cpp:101] start consumer group: 1646ff65fd091cdb-2d7ad5d4ef6c5893. max time(ms): 15000, batch size: 524288000. id=3fd6cf181505435f-bc9188aab6a8c0bc, job_id=67422, txn_id: 78442349, label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, db=default_cluster:beidou_rslog_database
I0816 13:50:05.396423 620234 tablet_sink.cpp:1071] Olap table sink statistics. load_id: 3fd6cf18-1505-435f-bc91-88aab6a8c0bc, txn_id: 78442349, add chunk time(ms)/wait lock time(ms)/num: {10729:(354)(0)(2)} {10002:(13)(0)(2)} {10584:(11)(0)(2)} {10831:(35)(0)(3)} {10787:(23)(0)(3)} {10670:(16)(0)(2)} {10004:(10)(0)(2)} {62765:(13)(0)(2)} {103038:(25)(0)(3)} {10003:(11)(0)(2)}
W0816 13:50:05.398897 620331 stream_load_executor.cpp:201] commit transaction failed, errmsg=timeout by txn managerid=3fd6cf181505435f-bc9188aab6a8c0bc, job_id=67422, txn_id: 78442349, label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, db=default_cluster:beidou_rslog_database
I0816 13:50:05.399330 620331 routine_load_task_executor.cpp:207] finished routine load task id=3fd6cf181505435f-bc9188aab6a8c0bc, job_id=67422, txn_id: 78442349, label=load_34300_t_monitor_logs_date-67422-3fd6cf18-1505-435f-bc91-88aab6a8c0bc-78442349, db=default_cluster:beidou_rslog_database, status: timeout by txn manager, current tasks num: 2

routine_load_task_timeout_second这个参数需要大于consume的那个参数

收到,我再调整下routine_load_task_timeout_second这个参数

@jingdan 调整了routine_load_task_timeout_second和routine_load_task_consume_second参数之后,routine load延迟慢慢就下来了,但是be节点异常重启之后,be节点load_rows就基本降为0了,然后routine load就开始消费延迟了,请问这种情况如何恢复be节点load能力呢?