【详述】streamload任务会报be超时的错,close channel failed. channel_name=NodeChannel[11001], load_info=load_id=764418c3-556b-77fd-caa9-6a93ff917384, txn_id: 29357366, parallel=1, compress_type=2, error_msg=[E1008]Reached timeout=60000ms @10.3.138.104:8060
【背景】数据导入表tablet数量7794个,整个集群tablet数量一共59615个
【业务影响】导入任务失败
【StarRocks版本】2.5.3
【集群规模】3fe+3be混部
【机器信息】64C/256G/万兆
【联系方式】社区群13-唐老鸭
【附件】
fe.log=======================================================================
2023-06-27 00:46:13,074 INFO (PUBLISH_VERSION|35) [PublishVersionDaemon.publishVersionForOlapTable():125] send publish tasks for txn_id: 29357645
2023-06-27 00:46:13,074 INFO (thrift-server-pool-55636275|56276796) [DatabaseTransactionMgr.commitTransaction():479] transaction:[TransactionState. txn_id: 29357645, label: 227e932f-4cdf-483b-9c6d-a079fd3a0b659ac82af6-50a9-4b5b-b6f7-6255510ef88c, db id: 12306, table id list: 832620, callback id: -1, coordinator: BE: 10.3.138.104, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1687797971657, commit time: 1687797973072, finish time: -1, write cost: 1415ms, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@2b9063ab] successfully committed
2023-06-27 00:46:13,304 WARN (Connect-Scheduler-Check-Timer-0|28) [ConnectContext.checkTimeout():604] kill wait timeout connection, remote: 172.16.41.99:64523, wait timeout: 28800
2023-06-27 00:46:13,304 WARN (Connect-Scheduler-Check-Timer-0|28) [ConnectContext.kill():561] kill query, 172.16.41.99:64523, kill connection: true
2023-06-27 00:46:13,337 WARN (thrift-server-pool-55637434|56277994) [LeaderImpl.finishTask():191] cannot find task. type: PUBLISH_VERSION, backendId: 11072, signature: 29357613
2023-06-27 00:46:13,489 INFO (thrift-server-pool-55636487|56277029) [FrontendServiceImpl.loadTxnCommit():960] receive txn commit request. db: ODS, tbl: D_API_ORDER, txn_id: 29357366, backend: 10.3.138.104
2023-06-27 00:46:13,497 WARN (thrift-server-pool-55636487|56277029) [OlapTableTxnStateListener.preCommit():192] Commit failed. txn: 29357366 table: D_API_ORDER tablet: 2965598 quorum: 1<2 errorReplicas: 2965601:{be:11072 10.3.138.106 V:51424 LFV:51425},2965600:{be:11001 10.3.138.104 V:51427 LFV:-1},
2023-06-27 00:46:13,497 WARN (thrift-server-pool-55636487|56277029) [FrontendServiceImpl.loadTxnCommit():978] failed to commit txn_id: 29357366: Commit failed. txn: 29357366 table: D_API_ORDER tablet: 2965598 quorum: 1<2 errorReplicas: 2965601:{be:11072 10.3.138.106 V:51424 LFV:51425},2965600:{be:11001 10.3.138.104 V:51427 LFV:-1},
be.log=====================================================================
W0627 00:46:11.067620 160235 utils.cpp:56] master client, retry finishTask: No more data to read.
W0627 00:46:11.068720 160235 utils.cpp:56] master client, retry finishTask: No more data to read.
W0627 00:46:11.753695 160266 utils.cpp:96] master client, retry finishTask: No more data to read.
W0627 00:46:12.706791 160252 utils.cpp:56] master client, retry finishTask: write() send(): Broken pipe
W0627 00:46:12.729686 160252 utils.cpp:56] master client, retry finishTask: No more data to read.
W0627 00:46:12.756657 160266 utils.cpp:96] master client, retry finishTask: No more data to read.
W0627 00:46:13.083334 160265 utils.cpp:96] master client, retry finishTask: No more data to read.
W0627 00:46:13.487699 158780 tablet_sink.cpp:1456] close channel failed. channel_name=NodeChannel[11001], load_info=load_id=764418c3-556b-77fd-caa9-6a93ff917384, txn_id: 29357366, parallel=1, compress_type=2, error_msg=[E1008]Reached timeout=60000ms @10.3.138.104:8060
W0627 00:46:13.497854 160528 stream_load_executor.cpp:210] commit transaction failed, errmsg=Commit failed. txn: 29357366 table: D_API_ORDER tablet: 2965598 quorum: 1<2 errorReplicas: 2965601:{be:11072 10.3.138.106 V:51424 LFV:51425},2965600:{be:11001 10.3.138.104 V:51427 LFV:-1},id=764418c3556b77fd-caa96a93ff917384, job_id=-1, txn_id: 29357366, label=66143edf-ad8b-42d7-a702-e81deeae648420c6aeb9-3aae-41da-8c34-f574991e3f22, db=ODS
W0627 00:46:13.497944 160528 stream_load.cpp:135] Fail to handle streaming load, id=764418c3556b77fd-caa96a93ff917384 errmsg=Commit failed. txn: 29357366 table: D_API_ORDER tablet: 2965598 quorum: 1<2 errorReplicas: 2965601:{be:11072 10.3.138.106 V:51424 LFV:51425},2965600:{be:11001 10.3.138.104 V:51427 LFV:-1},
W0627 00:46:13.648448 86546 agent_server.cpp:462] fail to make_snapshot. tablet_id:2298305 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:2298305 #version:70 [53702.1 53765@69 53765] #pending:0 request_version:53767,
gc.log========================================================================
2023-06-27T00:45:55.282+0800: 6505181.141: [GC (Allocation Failure) 2023-06-27T00:45:55.282+0800: 6505181.141: [ParNew: 570177K->16866K(629120K), 0.0195917 secs] 2218686K->1665375K(5142788K), 0.0198693 secs] [Times: user=0.51 sys=0.05, real=0.02 secs]
2023-06-27T00:46:02.435+0800: 6505188.294: [GC (Allocation Failure) 2023-06-27T00:46:02.435+0800: 6505188.294: [ParNew: 576098K->69560K(629120K), 0.0219149 secs] 2224607K->1726273K(5142788K), 0.0221961 secs] [Times: user=0.61 sys=0.12, real=0.02 secs]
2023-06-27T00:46:09.408+0800: 6505195.267: [GC (Allocation Failure) 2023-06-27T00:46:09.408+0800: 6505195.267: [ParNew: 622831K->13677K(629120K), 0.0274289 secs] 2279544K->1677163K(5142788K), 0.0277921 secs] [Times: user=0.34 sys=0.19, real=0.03 secs]
2023-06-27T00:46:16.689+0800: 6505202.548: [GC (Allocation Failure) 2023-06-27T00:46:16.689+0800: 6505202.548: [ParNew: 571847K->2062K(629120K), 0.0071973 secs] 2235333K->1665548K(5142788K), 0.0075146 secs] [Times: user=0.15 sys=0.02, real=0.01 secs]
2023-06-27T00:46:23.636+0800: 6505209.495: [GC (Allocation Failure) 2023-06-27T00:46:23.636+0800: 6505209.496: [ParNew: 560161K->1218K(629120K), 0.0127085 secs] 2223648K->1664704K(5142788K), 0.0130776 secs] [Times: user=0.16 sys=0.01, real=0.01 secs]
2023-06-27T00:46:26.922+0800: 6505212.781: [GC (Allocation Failure) 2023-06-27T00:46:26.922+0800: 6505212.781: [ParNew: 559140K->69724K(629120K), 0.0164322 secs] 2222626K->1740537K(5142788K), 0.0166766 secs] [Times: user=0.58 sys=0.02, real=0.01 secs]
2023-06-27T00:46:29.064+0800: 6505214.924: [GC (Allocation Failure) 2023-06-27T00:46:29.065+0800: 6505214.924: [ParNew: 628097K->67494K(629120K), 0.0183927 secs] 2298910K->1746944K(5142788K), 0.0186806 secs] [Times: user=0.50 sys=0.06, real=0.02 secs]
2023-06-27T00:46:31.306+0800: 6505217.165: [GC (Allocation Failure) 2023-06-27T00:46:31.306+0800: 6505217.165: [ParNew: 626140K->67339K(629120K), 0.0249543 secs] 2305590K->1754038K(5142788K), 0.0251764 secs] [Times: user=0.81 sys=0.00, real=0.03 secs]
2023-06-27T00:46:33.422+0800: 6505219.281: [GC (Allocation Failure) 2023-06-27T00:46:33.422+0800: 6505219.281: [ParNew: 626358K->65251K(629120K), 0.0154773 secs] 2313056K->1757888K(5142788K), 0.0157377 secs] [Times: user=0.46 sys=0.01, real=0.01 secs]
2023-06-27T00:46:35.256+0800: 6505221.115: [GC (Allocation Failure) 2023-06-27T00:46:35.256+0800: 6505221.115: [ParNew: 624209K->67445K(629120K), 0.0120083 secs] 2316845K->1767354K(5142788K), 0.0122048 secs] [Times: user=0.20 sys=0.01, real=0.01 secs]
2023-06-27T00:46:37.017+0800: 6505222.876: [GC (Allocation Failure) 2023-06-27T00:46:37.017+0800: 6505222.876: [ParNew: 625894K->103K(629120K), 0.0110343 secs] 2325802K->1700238K(5142788K), 0.0112688 secs] [Times: user=0.14 sys=0.01, real=0.01 secs]
2023-06-27T00:46:39.129+0800: 6505224.988: [GC (Allocation Failure) 2023-06-27T00:46:39.129+0800: 6505224.989: [ParNew: 558116K->15174K(629120K), 0.0172835 secs] 2258250K->1715308K(5142788K), 0.0175364 secs] [Times: user=0.36 sys=0.02, real=0.02 secs]
0.0172835 secs] 2258250K->1715308K(5142788K), 0.0175364 secs] [Times: user=0.36 sys=0.02, real=0.02 secs]
2023-06-27T00:46:41.209+0800: 6505227.068: [GC (Allocation Failure) 2023-06-27T00:46:41.209+0800: 6505227.068: [ParNew: 573305K->745K(629120K), 0.0140908 secs] 2273440K->1700880K(5142788K), 0.0143566 secs] [Times: user=0.26 sys=0.17, real=0.02 secs]
2023-06-27T00:46:43.283+0800: 6505229.142: [GC (Allocation Failure) 2023-06-27T00:46:43.283+0800: 6505229.142: [ParNew: 559260K->65478K(629120K), 0.0233544 secs] 2259395K->1770955K(5142788K), 0.0236465 secs] [Times: user=0.70 sys=0.10, real=0.02 secs]
2023-06-27T00:46:45.377+0800: 6505231.236: [GC (Allocation Failure) 2023-06-27T00:46:45.377+0800: 6505231.236: [ParNew: 623179K->63816K(629120K), 0.0186363 secs] 2328656K->1776655K(5142788K), 0.0188726 secs] [Times: user=0.51 sys=0.12, real=0.02 secs]
2023-06-27T00:46:47.462+0800: 6505233.322: [GC (Allocation Failure) 2023-06-27T00:46:47.463+0800: 6505233.322: [ParNew: 621925K->67453K(629120K), 0.0143250 secs] 2334764K->1787077K(5142788K), 0.0145633 secs] [Times: user=0.46 sys=0.04, real=0.01 secs]
以下是相关截图================================================================

