stream load Commit failed

【详述】stream load 往starrocks中导入数据
【业务影响】
【StarRocks版本】2.4.0
【集群规模】3fe(1 follower+2observer)+6be(fe与be分开部署)
【机器信息】CPU虚拟核/内存/网卡,16C/64G/千兆
【表模型】:主键模型
【导入或者导出方式】jdbc stream load
【联系方式】1374507895@qq,com
【附件】
Commit failed. txn: 344447 table: cs1ysj tablet: 1076897 quorum: 1<2 errorReplicas: 1076900:{be:11745 10.170.0.43 V:4557 LFV:4558},1076898:{be:10250 10.170.0.45 V:4559 LFV:4559}

发下详细报错和截图 这些信息有点少

tatusCode:200,result:{
“TxnId”: 344447,
“Label”: “653134d4-f703-491e-bfdf-2a175a8b7d87”,
“Status”: “Fail”,
“Message”: “Commit failed. txn: 344447 table: cs1ysj tablet: 1076897 quorum: 1<2 errorReplicas: 1076900:{be:11745 10.170.0.43 V:4557 LFV:4558},1076898:{be:10250 10.170.0.45 V:4559 LFV:4559},”,
“NumberTotalRows”: 500000,
“NumberLoadedRows”: 500000,
“NumberFilteredRows”: 0,
“NumberUnselectedRows”: 0,
“LoadBytes”: 52810200,
“LoadTimeMs”: 10770,
“BeginTxnTimeMs”: 1,
“StreamLoadPlanTimeMs”: 2,
“ReadDataTimeMs”: 1492,
“WriteDataTimeMs”: 10762,
“CommitAndPublishTimeMs”: 0
}
stream load直接返回的这个,也没有具体的错误URL,be的日志中也没有相应的报错

好的 1.去查看本次导入的load_id和调度到的BE节点IP grep -w $TxnId fe.log|grep “load id”

  1. 输出样例为:2023-1-30 20:48:50,169 INFO (thrift-server-pool-4|138) [FrontendServiceImpl.streamLoadPut():809] receive stream load put request. db:ssb, tbl: demo_test_1, txn id: 1580717, load id: 7a4d4384-1ad7-b798-f176-4ae9d7ea6b9d, backend: 172.26.92.155

  2. 我们可以看到对应的BE节点IP,去到该节点上查看具体原因:grep $load_id be.INFO|less

  3. 输出样例为:I0518 11:58:16.771597 4228 stream_load.cpp:202] new income streaming load request.id=f1481, job_id=-1, txn_id=-1, label=metrics_detail_16, db=starrocks, tbl=metrics_detail I0518 11:58:16.7 4176 load_channel_mgr.cpp:186] Removing finished load channel load id=f181 I0518 11:58:16.7 4176 load_channel.cpp:40] load channel mem peak usage=1915984, info=limit: 16113540169; label: f181; all tracker size: 3; limit trackers size: 3; parent is null: false; , load_id=f181
    这个里面会有详细的报错

  4. 如果还没有,则可以继续查看线程上下文进行进一步跟踪定位,比如上文的 4176线程 : grep -w 4176 be.INFO|less 进一步分析下

W0531 14:40:06.789105 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.794270 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.799448 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.804625 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.810240 1089 tablet_sink.cpp:1041] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
I0531 14:40:06.811036 1089 tablet_sink.cpp:1082] Olap table sink statistics. load_id: 4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, add chunk time(ms)/wait lock time(ms)/num: {10242:(536)(0)(49)} {10251:(1335)(0)(49)} {10238:(629)(0)(49)} {11745:(0)(0)(0)} {10250:(404)(0)(51)} {10246:(3439)(0)(51)}
内存爆了,主键模型导了6亿数据,尝试设置"enable_persistent_index" = "true"后,内存下来了,但是多个任务会爆 get database write lock timeout。看来只能增加内存了。

当前版本是2.4.0?

FE的元数据和BE的数据盘是在一块盘上?

2.4.0 c0fa2bb

没有,FE和BE是分开部署的

可以升级到2.5的最新版本或2.4的最新版本试试,锁的问题,我们优化过一些,现在不确定是否优化过你这个场景。

当时磁盘IO很高吗?开了persistent index后

建议先升级到2.4 最新的小版本,另外看下fe 日志,有 slow db lock 的日志,看下 lock wait 了多长时间,不行把be 的 txn_commit_rpc_timeout_ms 调大一点,现在默认是 20s(这个参数调整需要先升级到 2.4 最新版本才能生效)

lock 超时是15s,对导入的速度有要求,等待时间太长,客户不同意

开启persistent index前后的任务是一样的吗?
enable_persistent_index=true就会报lock timeout的错误吗?没开persistent index的时候导入的任务并发,数据量都是一致的吗?

任务是一样的,数据量也是一致的