stream load Commit failed

U_1665578855102_3803 · 2023年05月30日 07:12

【详述】stream load 往starrocks中导入数据
【业务影响】
【StarRocks版本】2.4.0
【集群规模】3fe（1 follower+2observer）+6be（fe与be分开部署）
【机器信息】CPU虚拟核/内存/网卡，16C/64G/千兆
【表模型】：主键模型
【导入或者导出方式】jdbc stream load
【联系方式】1374507895@qq,com
【附件】
Commit failed. txn: 344447 table: cs1ysj tablet: 1076897 quorum: 1<2 errorReplicas: 1076900:{be:11745 10.170.0.43 V:4557 LFV:4558},1076898:{be:10250 10.170.0.45 V:4559 LFV:4559}

yuchen1019 · 2023年05月30日 07:18

发下详细报错和截图这些信息有点少

U_1665578855102_3803 · 2023年05月30日 07:41

tatusCode:200,result:{
“TxnId”: 344447,
“Label”: “653134d4-f703-491e-bfdf-2a175a8b7d87”,
“Status”: “Fail”,
“Message”: “Commit failed. txn: 344447 table: cs1ysj tablet: 1076897 quorum: 1<2 errorReplicas: 1076900:{be:11745 10.170.0.43 V:4557 LFV:4558},1076898:{be:10250 10.170.0.45 V:4559 LFV:4559},”,
“NumberTotalRows”: 500000,
“NumberLoadedRows”: 500000,
“NumberFilteredRows”: 0,
“NumberUnselectedRows”: 0,
“LoadBytes”: 52810200,
“LoadTimeMs”: 10770,
“BeginTxnTimeMs”: 1,
“StreamLoadPlanTimeMs”: 2,
“ReadDataTimeMs”: 1492,
“WriteDataTimeMs”: 10762,
“CommitAndPublishTimeMs”: 0
}
stream load直接返回的这个，也没有具体的错误URL，be的日志中也没有相应的报错

yuchen1019 · 2023年05月30日 09:04

好的 1.去查看本次导入的load_id和调度到的BE节点IP grep -w $TxnId fe.log|grep “load id”

输出样例为：2023-1-30 20:48:50,169 INFO (thrift-server-pool-4|138) [FrontendServiceImpl.streamLoadPut():809] receive stream load put request. db:ssb, tbl: demo_test_1, txn id: 1580717, load id: 7a4d4384-1ad7-b798-f176-4ae9d7ea6b9d, backend: 172.26.92.155
我们可以看到对应的BE节点IP，去到该节点上查看具体原因：grep $load_id be.INFO|less
输出样例为：I0518 11:58:16.771597 4228 stream_load.cpp:202] new income streaming load request.id=f1481, job_id=-1, txn_id=-1, label=metrics_detail_16, db=starrocks, tbl=metrics_detail I0518 11:58:16.7 4176 load_channel_mgr.cpp:186] Removing finished load channel load id=f181 I0518 11:58:16.7 4176 load_channel.cpp:40] load channel mem peak usage=1915984, info=limit: 16113540169; label: f181; all tracker size: 3; limit trackers size: 3; parent is null: false; , load_id=f181
这个里面会有详细的报错
如果还没有，则可以继续查看线程上下文进行进一步跟踪定位，比如上文的 4176线程： grep -w 4176 be.INFO|less 进一步分析下

U_1665578855102_3803 · 2023年05月31日 07:43

W0531 14:40:06.789105 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.794270 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.799448 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.804625 1089 tablet_sink.cpp:973] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
W0531 14:40:06.810240 1089 tablet_sink.cpp:1041] close channel failed. channel_name=NodeChannel[1081040-11745], load_info=load_id=4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, parallel=1, compress_type=2, error_msg=Primary-key index exceeds the limit. tablet_id: 1081045, consumption: 25673469793, limit: 25471975882. Memory stats of top five tablets: 1081357(481M)1081229(481M)1081069(481M)1081165(481M)1081437(481M)
I0531 14:40:06.811036 1089 tablet_sink.cpp:1082] Olap table sink statistics. load_id: 4148e8b9-f685-2646-fa36-84f83d2f93b4, txn_id: 351609, add chunk time(ms)/wait lock time(ms)/num: {10242:(536)(0)(49)} {10251:(1335)(0)(49)} {10238:(629)(0)(49)} {11745:(0)(0)(0)} {10250:(404)(0)(51)} {10246:(3439)(0)(51)}
内存爆了，主键模型导了6亿数据，尝试设置"enable_persistent_index" = "true"后，内存下来了，但是多个任务会爆 get database write lock timeout。看来只能增加内存了。

trueeyu · 2023年05月31日 08:50

当前版本是2.4.0？

trueeyu · 2023年05月31日 08:52

FE的元数据和BE的数据盘是在一块盘上？

U_1665578855102_3803 · 2023年05月31日 08:52

2.4.0 c0fa2bb

U_1665578855102_3803 · 2023年05月31日 08:53

没有，FE和BE是分开部署的

trueeyu · 2023年05月31日 09:01

可以升级到2.5的最新版本或2.4的最新版本试试，锁的问题，我们优化过一些，现在不确定是否优化过你这个场景。

trueeyu · 2023年05月31日 09:01

当时磁盘IO很高吗？开了persistent index后

Dejun · 2023年05月31日 09:10

建议先升级到2.4 最新的小版本，另外看下fe 日志，有 slow db lock 的日志，看下 lock wait 了多长时间，不行把be 的 txn_commit_rpc_timeout_ms 调大一点，现在默认是 20s（这个参数调整需要先升级到 2.4 最新版本才能生效）

U_1665578855102_3803 · 2023年05月31日 09:13

lock 超时是15s,对导入的速度有要求，等待时间太长，客户不同意

zhangqiang · 2023年05月31日 09:19

开启persistent index前后的任务是一样的吗？
enable_persistent_index=true就会报lock timeout的错误吗？没开persistent index的时候导入的任务并发，数据量都是一致的吗？

U_1665578855102_3803 · 2023年05月31日 09:20

任务是一样的，数据量也是一致的