tablet ... version already has been merged. spec_version: [0-5577]

【详述】问题详细描述
表:数据量34E条/Buckets为10/更新模型(UNIQUE KEY)/主键为4个varchar字段/replication_num为3
导入数据频次 50万/批次
导入过程中,出现过一次(“transaction commit successfully, BUT data will be visible later”)并进行了重试写入(换个label重试的),部分日志:
2022-08-06 05:10:10,709 INFO (PUBLISH_VERSION|36) [DatabaseTransactionMgr.finishTransaction():885] finish transaction TransactionState. transaction id: 2734046, label: c10aae45-30d1-4b4e-a742-f09bad159349, db id: 10580, table id list: 38413, callback id: -1, coordinator: BE: 0.0.0.1, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1659733787340, commit time: 1659733802241, finish time: 1659733810708, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@23fa3b30 successfully

2022-08-06 05:10:19,810 INFO (PUBLISH_VERSION|36) [DatabaseTransactionMgr.finishTransaction():885] finish transaction TransactionState. transaction id: 2734053, label: 4d103b3a-974e-4a07-ac9a-aeb8a4587b3b, db id: 10580, table id list: 38413, callback id: -1, coordinator: BE: 0.0.0.34, transaction status: VISIBLE, error replicas num: 10, replica ids: 38755,38771,38759,38775,38747, prepare time: 1659733811126, commit time: 1659733818148, finish time: 1659733819810, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@5d530f9 successfully

问题:
少量数据丢失,怀疑是load时或者合并时导致

【导入/导出方式】 stream load(50万/批次)
【背景】做过哪些操作?
【业务影响】 莫名少数据,且基本可以排除load数据错误,核对过提交load的数据量与原表数据量丝毫不差
【StarRocks版本】2.1.6
【集群规模】3fe(1 follower+2observer)+3be(fe与be混部)
【机器信息】64C/300G/万兆
【附件】

1、be.warn的异常日志:
W0806 04:55:29.968008 171325 version_graph.cpp:421] fail to find path in version_graph. spec_version: 0-5577
W0806 04:55:29.968026 171325 tablet.cpp:472] tablet:38782.990418800.a5452e2d221491b1-0f8b78a014b29298, version already has been merged. spec_version: [0-5577]
W0806 04:55:29.968039 171325 engine_checksum_task.cpp:90] Failed to prepare tablet reader. tablet=38782.990418800.a5452e2d221491b1-0f8b78a014b29298, error:Unknown code(45): :
/root/starrocks/be/src/storage/tablet.cpp:516 capture_consistent_versions(spec_version, &version_path)
W0806 04:55:29.968057 171325 task_worker_pool.cpp:1191] check consistency failed. status: Unknown code(45): :
/root/starrocks/be/src/storage/tablet.cpp:516 capture_consistent_versions(spec_version, &version_path)
/root/starrocks/be/src/storage/storage_engine.cpp:1008 task->execute(), signature: 38782

2、show tablet的异常现象:
导入数据后,目前已过去2天时间。
show出来有30条数据(10个tablet*3副本),28条数据的versionCount都为1,但有两条数据特殊(tabletid-38746的某个副本versioncount为3,一个38766的某个副本versioncount为2)

3、数据表现:34E条数据,少了10229条(
主键字段 col1,col2,col3,col4,少的数据分别为:
col1=0,col2=x少10000条
col1=0,col2=y少229条
count(distinct col2)数量为10+,且不符合均匀分布
排查过主键字段,并非主键覆盖导致少数据

W0806 04:55:29.968008 171325 version_graph.cpp:421] fail to find path in version_graph. spec_version: 0-5577
W0806 04:55:29.968026 171325 tablet.cpp:472] tablet:38782.990418800.a5452e2d221491b1-0f8b78a014b29298, version already has been merged. spec_version: [0-5577]
W0806 04:55:29.968039 171325 engine_checksum_task.cpp:90] Failed to prepare tablet reader. tablet=38782.990418800.a5452e2d221491b1-0f8b78a014b29298, error:Unknown code(45): :
/root/starrocks/be/src/storage/tablet.cpp:516 capture_consistent_versions(spec_version, &version_path)
W0806 04:55:29.968057 171325 task_worker_pool.cpp:1191] check consistency failed. status: Unknown code(45): :
/root/starrocks/be/src/storage/tablet.cpp:516 capture_consistent_versions(spec_version, &version_path)
/root/starrocks/be/src/storage/storage_engine.cpp:1008 task->execute(), signature: 38782

有大佬知道这段异常是什么原因吗?


参照下这篇文章看下。