存算分离集群 3.4.0 使用云原生主键性能不符合预期,导入数据长时间prepared无法finish

建表语句

CREATE TABLE `dim_user` (
  `local_id` bigint(20) NOT NULL COMMENT "账号ID",
  `global_id` bigint(20) NULL COMMENT "globalID",
  `is_a1` tinyint(4) NULL COMMENT "业务域",
  `is_a2` tinyint(4) NULL COMMENT "业务域",
  `is_a3` tinyint(4) NULL COMMENT "业务域"
) ENGINE=OLAP 
PRIMARY KEY(`local_id`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`local_id`)
PROPERTIES (
"compression" = "LZ4",
"datacache.enable" = "true",
"enable_async_write_back" = "false",
"enable_persistent_index" = "true",
"persistent_index_type" = "CLOUD_NATIVE",
"replication_num" = "1",
"storage_volume" = "builtin_storage_volume"
);

broker load 状态,持续超过1小时未finish. 导入106G的 54亿的数据,压缩后10G。

leader日志. 持续出现The previous publish version task for tablet 10608 has not finished.

     2025-03-15 17:24:37.723+08:00 ERROR (lake-publish-task-1128|17122) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:38.732+08:00 ERROR (lake-publish-task-1129|17145) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:39.742+08:00 ERROR (lake-publish-task-1130|17173) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:40.751+08:00 ERROR (lake-publish-task-1131|17210) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:41.759+08:00 ERROR (lake-publish-task-1132|17233) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:42.769+08:00 ERROR (lake-publish-task-1133|17254) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:43.778+08:00 ERROR (lake-publish-task-1134|17259) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:44.786+08:00 ERROR (lake-publish-task-1135|17283) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:45.788+08:00 ERROR (lake-publish-task-1136|17319) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:46.799+08:00 ERROR (lake-publish-task-1137|17345) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:47.809+08:00 ERROR (lake-publish-task-1138|17367) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local
     2025-03-15 17:24:48.819+08:00 ERROR (lake-publish-task-1139|17395) [PublishVersionDaemon.publishPartition():827] Fail to publish partition 10596 of txn 7: Fail to publish version for tablets [10608]: The previous publish version task for tablet 10608 has not finished. You can ignore this error and the task will retry later., host: starrocks-share-data-xren-cn-7.starrocks-share-data-xren-cn-search.starrocks.svc.cluster.local

cn节点日志信息。
warning日志。大量thrift通信异常 no more data to read。还包含 write star cache file data/35f8fd02-eb23-422f-a6f9-dec0e5739035.sst failed, errmsg: INTERNAL: fail to write block because no valid cache spac

tablet相关日志.

txn_id相关日志

对象存储下行流量

这种情况是算引擎bug,还是对对象存储下行带宽要求比较高?

目前这种情况存算分离都不敢用主键,都在用明细表。。

请求大佬支援

@lvlouisaslia 大佬有空帮忙看看这个 :pray:

CN没有开cache吗?

默认开的,也没关。 本地盘应该是1TB左右

starlet_star_cache_disk_size_bytes = 236870912000
starlet_star_cache_disk_size_percent = 15
flush_thread_num_per_store = 32
number_tablet_writer_threads = 32
lake_metadata_cache_limit = 8589934592
starlet_fs_stream_buffer_size_bytes = 10485760
compact_threads = 24
max_cumulative_compaction_num_singleton_deltas = 100
thrift_rpc_timeout_ms = 30000
create_tablet_worker_count = 20
object_storage_request_timeout_ms = 300
mem_limit = 60G
datacache_disk_low_level = 10
datacache_disk_safe_level = 20
datacache_disk_high_level = 40

看监控, cache命中率多少, 使用率多少.

重新跑了一次 都是这种报错了

然后cache基本没有用

@lvlouisaslia

你这根本没用上cache, 不怪对象存储的带宽打得那么高.

storage_root_path没有配置吗?

@lvlouisaslia
storage_root_path用的默认配置。应该是生效了吧

重启后看看data cache init日志

I20250318 21:24:23.747688 140266263087808 starrocks_be.cpp:209] CN start step 7: exec engine init successfully
I20250318 21:24:23.750524 140266263087808 block_cache.cpp:50] init starcache engine, block_size: 262144
I20250318 21:24:23.750961 140266263087808 mem_cache.cpp:59] init mem cache with segment lru policy: [35,65], segment_freq_bits: 0, slru_name: default_cache_mem_slru
I20250318 21:24:23.751044 140266263087808 disk_cache.cpp:60] init disk cache with segment lru policy: [35,65], segment_freq_bits: 0, slru_name: default_cache_disk_0_slru
I20250318 21:24:23.751969 140266263087808 star_cache_impl.cpp:141] init starcache success. block_size: 262144, disk checksum: 0, mem_quota: 0, disk_quota: 0, scheduler threads: 10
I20250318 21:24:23.752108 140266263087808 starrocks_be.cpp:221] CN start step 9: datacache init successfully
I20250318 21:24:23.756446 140266263087808 starrocks_be.cpp:233] CN start step 10: staros worker init successfully

disk_quota 为0 是有问题吗? 为啥会是0呀。。