starrocks通过 streamload 写入频繁报错[E1008]Reached timeout=30000ms

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
2026-04-10 集群迁移完成后 ,所有 11 个 Flink CDC → StarRocks 实时采集任务 ​ 均持续出现 Stream Load 超时失败 ,错误稳定复现,至今未恢复。

  • 迁移来源 :老集群 3.3.18,原 1 FE / 6 BE(BE 不均衡,存在明显木桶效应),老集群以前是从 3.3.4 升级到 3.3.18 的
  • 迁移方式 :官方跨集群数据迁移工具(全量 + 增量同步)

【背景】做过哪些操作?

  1. 所有 fe 和 be 服务均正常,日志无明显服务级别的异常日志
  2. 集群的 CPU、内存、磁盘 IO、网络 IO 没有占用不是特别高,具体情况见附件
  3. 找了其中一次失败的事务 ID,查看了所有 be 与 fe与此事务 ID 有关的日志,发现写入卡在secondary replica 步骤,这里不清楚是在排队还是线程死掉了
    【业务影响】实时写入频繁报错,影响数据写入
    【是否存算分离】否
    【StarRocks版本】例如:3.3.18
    【集群规模】例如:3fe(1 follower+2observer)+5be(fe与be不混部)
    【机器信息】CPU虚拟核/内存/网卡,例如:24C/128G/千兆
    【联系方式】社区群 17-Golden
    【附件】
    be CPU、内存、磁盘 IO、网络 ID 监控图:

另外,在集群概览里发现 be 总是频繁 dead ,但是通过 show backends 查看都是正常的:
image

be 和 fe 涉及其中一个失败的事务 ID(2200384)的日志:

[root@fe1 ~]# cat /home/StarRocks/fe/log/fe.log | grep 2200384                                                                                                

2026-04-22 10:59:47.150+08:00 INFO (leaderCheckpointer|106) [LoadMgr.replayEndLoadJob():317] LOAD_JOB=5137727, operation={LoadJobEndOperation{id=5137727, load
ingStatus=EtlStatus{state=RUNNING, trackingUrl='', stats={}, counters={}, tableCounters={}, fileMap={}, progress=0, failMsg='', dppResult='null'}, progress=10
0, loadStartTimestamp=1776822003849, finishTimestamp=1776822024902, jobState=FINISHED, failMsg=null}}, msg={replay end load job}                              
2026-04-22 11:32:58.709+08:00 INFO (thrift-server-pool-242037|498216) [DatabaseTransactionMgr.beginTransaction():189] begin transaction: txn_id: 2200384 with 
label a7ab2459-d07e-4804-abc9-e4dfd66c2da9 from coordinator FE: fe1.cluster.local, listner id: 5185165, combinedTxnLog: false                                 
2026-04-22 11:32:58.726+08:00 INFO (thrift-server-pool-242037|498216) [StreamLoadTask.beginTxn():317] stream load a7ab2459-d07e-4804-abc9-e4dfd66c2da9 channel
_id 0 begin. db: DSDATAPROD, tbl: SFTCUC_T, txn_id: 2200384                                                                                                   
2026-04-22 11:33:31.743+08:00 INFO (thrift-server-pool-242038|498217) [DatabaseTransactionMgr.abortTransaction():571] transaction:[TransactionState. txn_id: 2
200384, label: a7ab2459-d07e-4804-abc9-e4dfd66c2da9, db id: 123029, table id list: 234849, callback id: 5185165, coordinator: FE: fe1.cluster.local, transacti
on status: ABORTED, error replicas num: 0, unknown replicas num: 0, prepare time: 1776828778709, write end time: -1, allow commit time: -1, commit time: -1, f
inish time: 1776828811739, total cost: 33030ms, reason: [E1008]Reached timeout=30000ms @192.168.0.191:18060, attachment: com.starrocks.load.loadv2.ManualLoadT
xnCommitAttachment@2f028a3b, partition commit info:[]] successfully rollback                 


[root@be1 ~]# cat /mnt/data1/StarRocks/be/log/be.INFO | grep 2200384                                                                                          

I20260422 11:32:59.703649 140121601525504 local_tablets_channel.cpp:756] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 open 9 delta writer: [493797:2][493793:1][493785:2][493777:2][493773:1][493765:2][493757:2][493805:2][493753:1]0 failed_tablets:  _num_remaining_sende
rs: 1                                                                                                                                                         
I20260422 11:33:01.734839 140121643489024 local_tablets_channel.cpp:614] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 commit 3 tablets: 493793,493773,493753                                                                                                                
I20260422 11:33:37.610197 140118900524800 agent_server.cpp:483] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=2    
I20260422 11:33:37.614201 139953103173376 agent_task.cpp:300] get clear transaction task task, signature:2200384, txn_id: 2200384, partition id size: 0       
I20260422 11:33:37.614206 139953103173376 storage_engine.cpp:683] Clearing transaction task txn_id: 2200384                                                   
I20260422 11:33:37.614556 139953103173376 storage_engine.cpp:704] Cleared transaction task txn_id: 2200384                                                    
I20260422 11:33:37.614562 139953103173376 agent_task.cpp:322] finish to clear transaction task. signature:2200384, txn_id: 2200384                            
I20260422 11:33:37.617108 139953103173376 agent_task.cpp:158] Remove task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=1      
[root@be1 ~]#                                                                                                                         

[root@be2 ~]# cat /mnt/data1/StarRocks/be/log/be.INFO | grep 2200384                                                                                          

I20260422 11:32:58.735701 139673884751616 local_tablets_channel.cpp:756] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 open 10 delta writer: [493809:1][493793:2][493789:1][493781:2][493773:2][493769:1][493761:2][493753:2][493801:2][493749:1]0 failed_tablets:  _num_rema
ining_senders: 1                                                                                                                                              
I20260422 11:33:01.731886 139674310678272 local_tablets_channel.cpp:614] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 commit 4 tablets: 493809,493789,493769,493749                                                                                                         
W20260422 11:33:33.019997 139471364216576 async_delta_writer.cpp:67] Fail to write or commit. txn_id: 2200384 tablet_id: 493769: Cancelled: cancel            
W20260422 11:33:34.188217 139678793541376 segment_replicate_executor.cpp:167] Failed to send rpc to SyncChannnel [host: 192.168.0.193, port: 18060, load_id: f
a4b4a00-8f42-7a9c-6421-4c73be233cb3, tablet_id: 493749, txn_id: 2200384] err=Internal error: no associated load channel fa4b4a00-8f42-7a9c-6421-4c73be233cb3  
W20260422 11:33:34.188269 139678793541376 segment_replicate_executor.cpp:306] Failed to sync segment SyncChannnel [host: 192.168.0.193, port: 18060, load_id: 
fa4b4a00-8f42-7a9c-6421-4c73be233cb3, tablet_id: 493749, txn_id: 2200384] err Internal error: no associated load channel fa4b4a00-8f42-7a9c-6421-4c73be233cb3 
W20260422 11:33:34.188363 139517354809088 async_delta_writer.cpp:67] Fail to write or commit. txn_id: 2200384 tablet_id: 493749: Cancelled: cancel            
I20260422 11:33:37.641174 139670396528384 agent_server.cpp:483] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=2    
I20260422 11:33:37.660721 139472807106304 agent_task.cpp:300] get clear transaction task task, signature:2200384, txn_id: 2200384, partition id size: 0       
I20260422 11:33:37.660733 139472807106304 storage_engine.cpp:683] Clearing transaction task txn_id: 2200384                                                   
I20260422 11:33:37.661264 139472807106304 storage_engine.cpp:704] Cleared transaction task txn_id: 2200384                                                    
I20260422 11:33:37.661279 139472807106304 agent_task.cpp:322] finish to clear transaction task. signature:2200384, txn_id: 2200384                            
I20260422 11:33:37.664354 139472807106304 agent_task.cpp:158] Remove task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=1      
I20260422 11:33:55.984027 139673876358912 local_tablets_channel.cpp:780] cancel LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a008f427a9c-64214c73be233cb3
 index_id: 234850 #tablet:10 tablet_ids:493809,493793,493789,493781,493773,493769,493761,493753,493801,493749          


[root@be3 ~]# cat /mnt/data1/StarRocks/be/log/be.INFO | grep 2200384                                                                                          

I20260422 11:32:58.736470 139735635113728 local_tablets_channel.cpp:756] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 open 9 delta writer: [493801:1][493793:2][493785:2][493781:1][493773:2][493765:2][493761:1][493805:2][493753:2]0 failed_tablets:  _num_remaining_sende
rs: 1                                                                                                                                                         
I20260422 11:33:01.732773 139735689139968 local_tablets_channel.cpp:614] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 commit 3 tablets: 493801,493781,493761                                                                                                                
I20260422 11:33:37.643101 139732626073344 agent_server.cpp:483] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=2    
I20260422 11:33:37.646571 139620192868096 agent_task.cpp:300] get clear transaction task task, signature:2200384, txn_id: 2200384, partition id size: 0       
I20260422 11:33:37.646578 139620192868096 storage_engine.cpp:683] Clearing transaction task txn_id: 2200384                                                   
I20260422 11:33:37.647465 139620192868096 storage_engine.cpp:704] Cleared transaction task txn_id: 2200384                                                    
I20260422 11:33:37.647477 139620192868096 agent_task.cpp:322] finish to clear transaction task. signature:2200384, txn_id: 2200384                            
I20260422 11:33:37.647788 139620192868096 agent_task.cpp:158] Remove task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=1      
[root@be3 ~]#                                                                                                              


[root@be4 ~]# cat /mnt/data1/StarRocks/be/log/be.INFO | grep 2200384                                                                                          

I20260422 11:33:01.729009 139923997382400 local_tablets_channel.cpp:756] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 open 10 delta writer: [493805:1][493797:2][493789:2][493785:1][493777:2][493769:2][493765:1][493809:2][493757:2][493749:2]0 failed_tablets:  _num_rema
ining_senders: 1                                                                                                                                              
I20260422 11:33:03.006169 139924005775104 local_tablets_channel.cpp:614] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 commit 3 tablets: 493805,493785,493765                                                                                                                
I20260422 11:33:33.016984 139923938633472 local_tablets_channel.cpp:344] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 wai
t tablet 493769 secondary replica finish timeout 30000ms still in state 1, primary replica: 192.168.0.191                                                     
I20260422 11:33:37.606293 139920597305088 agent_server.cpp:483] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=2    
I20260422 11:33:37.609707 139807894206208 agent_task.cpp:300] get clear transaction task task, signature:2200384, txn_id: 2200384, partition id size: 0       
I20260422 11:33:37.609712 139807894206208 storage_engine.cpp:683] Clearing transaction task txn_id: 2200384                                                   
I20260422 11:33:37.609978 139807894206208 storage_engine.cpp:704] Cleared transaction task txn_id: 2200384                                                    
I20260422 11:33:37.609983 139807894206208 agent_task.cpp:322] finish to clear transaction task. signature:2200384, txn_id: 2200384                            
I20260422 11:33:37.611705 139807894206208 agent_task.cpp:158] Remove task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=1  


[root@be5 ~]# cat /mnt/data1/StarRocks/be/log/be.INFO | grep 2200384                                                                                          

I20260422 11:32:58.725921 140223588755200 stream_load_executor.cpp:77] begin to execute job. label=a7ab2459-d07e-4804-abc9-e4dfd66c2da9, txn_id: 2200384, quer
y_id=fa4b4a00-8f42-7a9c-6421-4c73be233cb3                                                                                                                     
I20260422 11:32:58.727577 140224023148288 local_tablets_channel.cpp:756] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 open 10 delta writer: [493797:1][493789:2][493781:2][493777:1][493769:2][493761:2][493809:2][493757:1][493801:2][493749:2]0 failed_tablets:  _num_rema
ining_senders: 1                                                                                                                                              
I20260422 11:33:01.723709 140224023148288 local_tablets_channel.cpp:614] LocalTabletsChannel txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 sin
k_id: 0 commit 3 tablets: 493797,493777,493757                                                                                                                
I20260422 11:33:03.035321 140228019885824 tablet_sink_index_channel.cpp:861] OlapTableSink txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 commi
t 9 tablets: 493761,493781,493801                                                                                                                             
I20260422 11:33:14.453878 140228019885824 tablet_sink_index_channel.cpp:861] OlapTableSink txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 commi
t 9 tablets: 493753,493773,493793                                                                                                                             
I20260422 11:33:28.386667 140228019885824 tablet_sink_index_channel.cpp:861] OlapTableSink txn_id: 2200384 load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3 commi
t 9 tablets: 493757,493777,493797                                                                                                                             
W20260422 11:33:31.728569 140228019885824 tablet_sink_sender.cpp:250] close channel failed. channel_name=NodeChannel[10306], load_info=load_id=fa4b4a00-8f42-7
a9c-6421-4c73be233cb3, txn_id: 2200384, parallel=1, compress_type=2, error_msg=[E1008]Reached timeout=30000ms @192.168.0.193:18060                            
W20260422 11:33:31.728672 140228019885824 tablet_sink_sender.cpp:250] close channel failed. channel_name=NodeChannel[10003], load_info=load_id=fa4b4a00-8f42-7
a9c-6421-4c73be233cb3, txn_id: 2200384, parallel=1, compress_type=2, error_msg=[E1008]Reached timeout=30000ms @192.168.0.191:18060                            
I20260422 11:33:31.728828 140228019885824 tablet_sink_sender.cpp:353] Olap table sink statistics. load_id: fa4b4a00-8f42-7a9c-6421-4c73be233cb3, txn_id: 22003
84, add chunk time(ms)/wait lock time(ms)/num: {10003:(0)(0)(0)} {10307:(26661)(0)(1)} {10306:(0)(0)(0)} {10002:(12727)(0)(1)} {10004:(1310)(0)(1)}           
     - TxnID: 2200384                                                                                                                                         
W20260422 11:33:31.731943 140228019885824 stream_load_executor.cpp:112] fragment execute failed, query_id=fa4b4a008f427a9c-64214c73be233cb3, err_msg=[E1008]Re
ached timeout=30000ms @192.168.0.191:18060, id=fa4b4a008f427a9c-64214c73be233cb3, job_id=-1, txn_id: 2200384, label=a7ab2459-d07e-4804-abc9-e4dfd66c2da9, db=D
SDATAPROD                                                                                                                                                     
W20260422 11:33:31.732021 140223588755200 stream_load.cpp:160] Fail to handle streaming load, id=fa4b4a008f427a9c-64214c73be233cb3 errmsg=[E1008]Reached timeo
ut=30000ms @192.168.0.191:18060 id=fa4b4a008f427a9c-64214c73be233cb3, job_id=-1, txn_id: 2200384, label=a7ab2459-d07e-4804-abc9-e4dfd66c2da9, db=DSDATAPROD   
I20260422 11:33:37.601450 140153462949632 agent_server.cpp:483] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=2    
I20260422 11:33:37.605191 140103785621248 agent_task.cpp:300] get clear transaction task task, signature:2200384, txn_id: 2200384, partition id size: 0       
I20260422 11:33:37.605197 140103785621248 storage_engine.cpp:683] Clearing transaction task txn_id: 2200384                                                   
I20260422 11:33:37.605709 140103785621248 storage_engine.cpp:704] Cleared transaction task txn_id: 2200384                                                    
I20260422 11:33:37.605723 140103785621248 agent_task.cpp:322] finish to clear transaction task. signature:2200384, txn_id: 2200384                            
I20260422 11:33:37.605946 140103785621248 agent_task.cpp:158] Remove task success. type=CLEAR_TRANSACTION_TASK, signature=2200384, task_count_in_queue=1      
[root@be5 ~]#

这里简单分析了一下时间线:

11:32:58  be5(入口BE)开始事务
11:32:58  be2(.191)、be3(.192)、be5(.195) open delta writer
11:33:01  be1(.190)、be2(.191)、be3(.192)、be5(.195) 本地 commit(约3秒,正常)
11:33:01  be4(.194) open delta writer  ← 注意!比其他BE晚了3秒!
11:33:03  be4(.194) 本地 commit

11:33:03  be5收到 be3 的回包(be3只用了3秒)
11:33:14  be5收到 be2 的回包(be2用了11秒,偏慢)
11:33:28  be5收到 be1 的回包(be1用了25秒,很慢)

11:33:33  be4 报:wait tablet 493769 secondary replica timeout
          tablet 493769 主副本在 .191,从副本迟迟不回应
11:33:31  be5 超时,NodeChannel[10306](.193) 和 NodeChannel[10003](.191) 都没回包

感觉像是数据写入的时候,主副本在等从副本响应的时候超时了,但是具体为什么超时后续就不知道怎么调查了

查看了社区帖子,设置了 flush_thread_num_per_store = 4 ,无明显改善