SR跨集群数据迁移, 未迁移完同步程序直接退出

starrocker123 · 2024年10月8日 10:19

为了更快的定位您的问题，请提供以下信息，谢谢
【详述】在做跨集群数据迁移的时候，程序迁移到32T后就直接退出，还有5T数据待迁移完毕
【背景】试过几个动作： 1. 修改过好几次同步参数，重启程序迁移（无效果） 2. 清除目标集群已同步的所有数据，重新开始同步（无效果）
【业务影响】直接导致两边集群数据不一致
【是否存算分离】否
【StarRocks版本】源集群版本：v3.1.2 目标集群版本：v3.2.8
【集群规模】源集群：3FE 16g/8c/100g 7BE 64g/32C/10T
目标集群：3FE 64g/32c/200g 7BE 64g/32C/10T
【机器信息】 64g/32C/万兆
【表模型】明细模型
【导入或者导出方式】 starrocks-cluster-sync工具
【联系方式】hsdcloud#163.com
【附件】

image2148×260 145 KB
开启debug级别日志后，有下面的日志，不清楚具体什么意思？（忽略复制动作？）
24/10/08 18:12:44 DEBUG [replication-job-handler] create(GenericPool.java:106): before create socket hostname=7.44.76.10 key.port=9020 timeoutMs=600000
24/10/08 18:12:44 DEBUG [replication-job-handler] sendReplicationJob(Utils.java:255): Ignore the replication job 10153-3002667_1728382361526-1 to TNetworkAddress(hostname:7.44.76.10, port:9020), table: ods_xf.ods_xf_game_log, detail message: Replication job of table 3002667 is already running

Doni · 2024年10月8日 12:14

发一下 conf/sync.properties 配置文件

Doni · 2024年10月8日 12:16

这个日志中看有一张 expired table，已经全量同步过，但是后面源集群上又写入过数据的表会被认为是数据过期的表 expired table。检查下这5T数据对应的表是不是存在这种情况

starrocker123 · 2024年10月9日 02:16

# If true, all tables will be synchronized only once, and the program will exit automatically after completion.
one_time_run_mode=true

source_fe_host=
source_fe_query_port=9030
source_cluster_user=
source_cluster_password=
source_cluster_password_secret_key=
source_cluster_token=

target_fe_host=
target_fe_query_port=9030
target_cluster_user=
target_cluster_password=
target_cluster_password_secret_key=

# Comma-separated list of database names or table names like <db_name> or <db_name.table_name>
# example: db1,db2.tbl2,db3
# Effective order: 1. include 2. exclude
include_data_list=ods_xf.ods_xf_game_log
exclude_data_list=

# If there are no special requirements, please maintain the default values for the following configurations.
target_cluster_storage_volume=
target_cluster_replication_num=-1
target_cluster_max_disk_used_percent=80

max_replication_data_size_per_job_in_gb=100

meta_job_interval_seconds=180
meta_job_threads=4
ddl_job_interval_seconds=10
ddl_job_batch_size=10
ddl_job_allow_drop_target_only=false
ddl_job_allow_drop_schema_change_table=true
ddl_job_allow_drop_inconsistent_partition=true
ddl_job_allow_drop_partition_target_only=true
replication_job_interval_seconds=10
replication_job_batch_size=20
report_interval_seconds=300

starrocker123 · 2024年10月9日 02:27

都是一张表ods_xf.ods_xf_game_log同步，这张表总数据量37T，试了2次重新同步，都是卡在32T，之后数据同步就不再增长了

打开debug日志后有thrift socket连接相关报错，频繁在报