【详述】集群有十几张异步物化视图,不定时出现timeout报错,追查日志发现事务发现了回滚
【背景】基础表数据量有波动,集群未做任何操作
【业务影响】影响物化视图刷新
【StarRocks版本】2.3.5
【集群规模】例如:3fe(1 follower+2observer)+8be(fe与be分开部署)
【机器信息】32Cores/128GB/万兆网卡
【联系方式】社区16群 一叶
【附件】
-
fe.log
20231114-1:2023-11-14 23:56:29,990 INFO (txnTimeoutChecker|73) [DatabaseTransactionMgr.abortTransaction():1263] transaction:[TransactionState. txn_id: 115950499, label: insert_acf904d3-8305-11ee-9dfb-0632b3434595, db id: 2215798, table id list: 18596555, callback id: -1, coordinator: FE: xxxx, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1699977089967, commit time: -1, finish time: 1699977389987, total cost: 300020ms, reason: timeout by txn manager] successfully rollback -
be.log
be.INFO.log.20231114-230750:I1114 23:51:29.972644 26866 local_tablets_channel.cpp:485] LocalTabletsChannel txn_id: 115950499 load_id: acf904d3-8305-11ee-9dfb-0632b3434595 open delta writer: [37398742:0][37398730:0][37398718:0][37398706:0][37398694:0][37398682:0] failed_tablets:
be.INFO.log.20231114-230750:I1114 23:51:38.846621 26186 tablet_sink.cpp:1448] Olap table sink statistics. load_id: acf904d3-8305-11ee-9dfb-0632b3434595, txn_id: 115950499, add chunk time(ms)/wait lock time(ms)/num: {10008:(27)(0)(13)} {36620921:(11)(0)(13)} {11077432:(5)(0)(13)} {1025798:(6)(0)(11)} {10004:(11)(0)(10)} {35054236:(17)(0)(11)} {33375062:(11)(0)(11)} {1025799:(17)(0)(10)} {33566614:(14)(0)(10)}
be.INFO.log.20231114-230750:I1114 23:51:38.974539 26933 local_tablets_channel.cpp:343] LocalTabletsChannel txn_id: 115950499 load_id: acf904d3-8305-11ee-9dfb-0632b3434595 commit 6 tablets: 37398742,37398730,37398718,37398706,37398694,37398682
be.INFO.log.20231114-230750:I1114 23:51:39.079710 25962 txn_manager.cpp:285] Commit txn successfully. tablet: 37398742, txn_id: 115950499, rowsetid: 02000001809cd3893146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:51:39.091668 25961 txn_manager.cpp:285] Commit txn successfully. tablet: 37398718, txn_id: 115950499, rowsetid: 02000001809cd3873146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:51:39.112010 25960 txn_manager.cpp:285] Commit txn successfully. tablet: 37398694, txn_id: 115950499, rowsetid: 02000001809cd3853146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:51:39.125519 25956 txn_manager.cpp:285] Commit txn successfully. tablet: 37398682, txn_id: 115950499, rowsetid: 02000001809cd3843146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:51:39.126968 27016 txn_manager.cpp:285] Commit txn successfully. tablet: 37398730, txn_id: 115950499, rowsetid: 02000001809cd3883146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:51:39.155534 27009 txn_manager.cpp:285] Commit txn successfully. tablet: 37398706, txn_id: 115950499, rowsetid: 02000001809cd3863146fb3bade1764bcefe682d3582cbb5 #segment:1 #delfile:0
be.INFO.log.20231114-230750:I1114 23:56:26.375923 19920 heartbeat_server.cpp:80] Updating master info: TMasterInfo(network_address=TNetworkAddress(hostname=xxxx, port=9020), cluster_id=1273598775, epoch=33, token=ffb80594-3c2b-4f1f-86c9-66185206230f, backend_ip=xxxxx, http_port=8030, heartbeat_flags=0, backend_id=10008, min_active_txn_id=115950499)
be.INFO.log.20231114-230750:I1115 00:00:09.669313 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398682.46029588.0246ba39d8b17228-9c0db0e9928cfeac, rowset: 02000001809cd3843146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:00:09.669340 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398694.46029588.be4d4df167e97932-199ecf39cc9353af, rowset: 02000001809cd3853146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:00:09.669358 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398706.46029588.f346e78cb0f02f44-76220c84c8834a8f, rowset: 02000001809cd3863146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:00:09.669373 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398718.46029588.7242db6c60987caa-89f7e8739debc398, rowset: 02000001809cd3873146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:00:09.669389 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398730.46029588.9a47f3a09a9c78a4-87646db87ce7aa94, rowset: 02000001809cd3883146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:00:09.669406 26877 txn_manager.cpp:527] remove tablet related txn. partition_id: 37398681, txn_id: 115950499, tablet: 37398742.46029588.23483d2514bd59e7-591c25db2cb7f9b3, rowset: 02000001809cd3893146fb3bade1764bcefe682d3582cbb5
be.INFO.log.20231114-230750:I1115 00:03:24.114102 21001 agent_server.cpp:381] Submit task success. type=CLEAR_TRANSACTION_TASK, signature=115950499
be.INFO.log.20231114-230750:I1115 00:03:24.114351 26831 agent_task.cpp:240] get clear transaction task task, signature:115950499, txn_id: 115950499, partition id size: 0
be.INFO.log.20231114-230750:I1115 00:03:24.114354 26831 storage_engine.cpp:574] Clearing transaction task txn_id: 115950499
be.INFO.log.20231114-230750:I1115 00:03:24.114356 26831 storage_engine.cpp:595] Cleared transaction task txn_id: 115950499
be.INFO.log.20231114-230750:I1115 00:03:24.114356 26831 agent_task.cpp:257] finish to clear transaction task. signature:115950499, txn_id: 115950499- be节点cpu和内存使用率截图
说明:在凌晨00:00后有个资源的大幅回升波动是因为部分SR大表设置的清理策略删除了一个分区导致数据量下降,所以资源回升。
- be节点cpu和内存使用率截图
-
查询报错:
fe.log.20231114-1:2023-11-14 23:56:29,990 INFO (txnTimeoutChecker|73) [DatabaseTransactionMgr.abortTransaction():1263] transaction:[TransactionState. txn_id: 115950499, label: insert_acf904d3-8305-11ee-9dfb-0632b3434595, db id: 2215798, table id list: 18596555, callback id: -1, coordinator: FE: xxxx, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1699977089967, commit time: -1, finish time: 1699977389987, total cost: 300020ms, reason: timeout by txn manager] successfully rollback