StarRocks 外部表同步异常,目标集群 be 节点加入到了源集群 be

【详述】
1、背景:为了实现 StarRocks 读写分离,部署了两个 StarRocks 集群 。两个 SR 的 BE 是部署到同一个服务器,用端口区分开。在源 SR 集群上创建一个指定目标集群的外表。
2、问题:当写入外部表时,发现源集群多了目标集群的 BE 节点。重启源 SR 集群 fe 可以恢复正常,但是只要一写入外部表就会出现这个问题。
【业务影响】无,开发环境开发中。
【是否存算分离】否
【StarRocks版本】3.3.0
【集群规模】源集群:3fe+2be 目标集群:1fe+2be
【附件】
mysql> show backends;
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+
| BackendId | IP | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime | LastHeartbeat | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version | Status | DataTotalCapacity | DataUsedPct | CpuCores | NumRunningQueries | MemUsedPct | CpuUsedPct | DataCacheMetrics | Location |
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+
| 1149057 | 10.0.0.130 | 9050 | 9060 | 8040 | 8060 | 2024-07-03 13:45:05 | 2024-07-19 11:17:38 | true | false | false | 21478 | 14.878 GB | 353.136 GB | 464.336 GB | 23.95 % | 23.95 % | | 3.3.0-19a3f66 | {“lastSuccessReportTabletsTime”:“2024-07-19 11:17:18”} | 368.014 GB | 4.04 % | 4 | 1 | 25.19 % | 25.4 % | Status: Normal, DiskUsage: 0B/230GB, MemUsage: 0B/0B | |
| 1149055 | 10.0.0.32 | 9050 | 9060 | 8040 | 8060 | 2024-07-03 13:54:55 | 2024-07-19 11:17:38 | true | false | false | 21478 | 15.173 GB | 596.938 GB | 849.976 GB | 29.77 % | 29.77 % | | 3.3.0-19a3f66 | {“lastSuccessReportTabletsTime”:“2024-07-19 11:17:04”} | 612.111 GB | 2.48 % | 12 | 1 | 24.17 % | 12.4 % | Status: Normal, DiskUsage: 0B/360GB, MemUsage: 0B/0B | |
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+
2 rows in set (0.01 sec)

mysql> show backends;
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+
| BackendId | IP | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime | LastHeartbeat | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version | Status | DataTotalCapacity | DataUsedPct | CpuCores | NumRunningQueries | MemUsedPct | CpuUsedPct | DataCacheMetrics | Location |
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+
| 10003 | 10.0.0.130 | 0 | 29060 | 28040 | 28060 | NULL | NULL | true | false | false | 0 | 0.000 B | 1.000 B | 0.000 B | 0.00 % | 0.00 % | | | {“lastSuccessReportTabletsTime”:“N/A”} | 0.000 B | 0.00 % | 0 | 0 | 0.00 % | 0.0 % | N/A | |
| 1149057 | 10.0.0.130 | 9050 | 9060 | 8040 | 8060 | 2024-07-03 13:45:05 | 2024-07-19 11:17:38 | true | false | false | 21478 | 14.878 GB | 353.136 GB | 464.336 GB | 23.95 % | 23.95 % | | 3.3.0-19a3f66 | {“lastSuccessReportTabletsTime”:“2024-07-19 11:17:18”} | 368.014 GB | 4.04 % | 4 | 1 | 25.20 % | 54.9 % | Status: Normal, DiskUsage: 0B/230GB, MemUsage: 0B/0B | |
| 10004 | 10.0.0.32 | 0 | 29060 | 28040 | 28060 | NULL | NULL | true | false | false | 0 | 0.000 B | 1.000 B | 0.000 B | 0.00 % | 0.00 % | | | {“lastSuccessReportTabletsTime”:“N/A”} | 0.000 B | 0.00 % | 0 | 0 | 0.00 % | 0.0 % | N/A | |
| 1149055 | 10.0.0.32 | 9050 | 9060 | 8040 | 8060 | 2024-07-03 13:54:55 | 2024-07-19 11:17:38 | true | false | false | 21478 | 15.173 GB | 596.938 GB | 849.976 GB | 29.77 % | 29.77 % | | 3.3.0-19a3f66 | {“lastSuccessReportTabletsTime”:“2024-07-19 11:17:04”} | 612.111 GB | 2.48 % | 12 | 0 | 24.17 % | 14.6 % | Status: Normal, DiskUsage: 0B/360GB, MemUsage: 0B/0B | |
±----------±-------------±--------------±-------±---------±---------±--------------------±--------------------±------±---------------------±----------------------±----------±-----------------±--------------±--------------±--------±---------------±-------±--------------±-------------------------------------------------------±------------------±------------±---------±------------------±-----------±-----------±-----------------------------------------------------±---------+

fe 日志:
2024-07-19 11:17:41.498+08:00 INFO (tablet scheduler|36) [TabletScheduler.updateWorkingSlots():265] add new backend 10003 with slots num: 0
2024-07-19 11:17:41.498+08:00 INFO (tablet scheduler|36) [TabletScheduler.updateWorkingSlots():265] add new backend 10004 with slots num: 0
2024-07-19 11:17:41.656+08:00 INFO (thrift-server-pool-17|262) [FrontendServiceImpl.forward():1188] receive forwarded stmt 171 from FE: 10.0.0.67
2024-07-19 11:17:41.659+08:00 INFO (thrift-server-pool-17|262) [BackendsProcDir.getClusterBackendInfos():252] backends proc get tablet num cost: 0, total cost: 2
2024-07-19 11:17:57.511+08:00 INFO (tablet scheduler|36) [ClusterLoadStatistic.classifyBackendByLoad():163] classify backend by load. medium: HDD, avg load score: 1.0317572273049016, low/mid/high: 1/0/1
2024-07-19 11:17:57.512+08:00 INFO (tablet scheduler|36) [TabletScheduler.updateClusterLoadStatistic():493] update cluster load statistic:
{“beId”:10003,“clusterName”:“default_cluster”,“isAvailable”:true,“cpuCores”:0,“memLimit”:0,“memUsed”:0,“mediums”:[{“medium”:“HDD”,“replica”:0,“used”:0,“total”:“0B”,“score”:0.0},{“medium”:“SSD”,“replica”:0,“used”:0,“total”:“0B”,“score”:NaN}],“paths”:[]}
{“beId”:10004,“clusterName”:“default_cluster”,“isAvailable”:true,“cpuCores”:0,“memLimit”:0,“memUsed”:0,“mediums”:[{“medium”:“HDD”,“replica”:0,“used”:0,“total”:“0B”,“score”:0.0},{“medium”:“SSD”,“replica”:0,“used”:0,“total”:“0B”,“score”:NaN}],“paths”:[]}
{“beId”:1149055,“clusterName”:“default_cluster”,“isAvailable”:true,“cpuCores”:12,“memLimit”:14495514624,“memUsed”:3518386832,“mediums”:[{“medium”:“HDD”,“replica”:21474,“used”:16292203352,“total”:“612.1GB”,“score”:0.9042423455160667},{“medium”:“SSD”,“replica”:0,“used”:0,“total”:“0B”,“score”:NaN}],“paths”:[{“beId”:1149055,“path”:"/data/starrocks/srdata/storage",“pathHash”:944752296301124751,“storageMedium”:“HDD”,“total”:657249237848,“used”:16292203352}]}
{“beId”:1149057,“clusterName”:“default_cluster”,“isAvailable”:true,“cpuCores”:4,“memLimit”:11596411699,“memUsed”:2936659184,“mediums”:[{“medium”:“HDD”,“replica”:21474,“used”:15974838959,“total”:“368GB”,“score”:1.1592721090937363},{“medium”:“SSD”,“replica”:0,“used”:0,“total”:“0B”,“score”:NaN}],“paths”:[{“beId”:1149057,“path”:"/data/starrocks/srdata/storage",“pathHash”:-7326092759831245816,“storageMedium”:“HDD”,“total”:395151704751,“used”:15974838959}]}

2024-07-19 11:17:57.715+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:57.792+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:57.868+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:57.951+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.027+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.102+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.169+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.220+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.274+08:00 INFO (AutoStatistic|28) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: []
2024-07-19 11:17:58.274+08:00 WARN (heartbeat-mgr-pool-6|135) [HeartbeatMgr$BackendHeartbeatHandler.call():332] backend heartbeat got exception, addr: 10.0.0.32:0
org.apache.thrift.transport.TTransportException: Invalid port 0
at org.apache.thrift.transport.TSocket.open(TSocket.java:218) ~[libthrift-0.20.0.jar:0.20.0]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:148) ~[starrocks-fe.jar:?]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:133) ~[starrocks-fe.jar:?]
at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:276) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:262) ~[starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
2024-07-19 11:17:58.274+08:00 WARN (heartbeat-mgr-pool-4|133) [HeartbeatMgr$BackendHeartbeatHandler.call():332] backend heartbeat got exception, addr: 10.0.0.130:0
org.apache.thrift.transport.TTransportException: Invalid port 0
at org.apache.thrift.transport.TSocket.open(TSocket.java:218) ~[libthrift-0.20.0.jar:0.20.0]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:148) ~[starrocks-fe.jar:?]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:133) ~[starrocks-fe.jar:?]
at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:276) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:262) ~[starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
2024-07-19 11:17:58.275+08:00 WARN (heartbeat mgr|16) [HeartbeatMgr.runAfterCatalogReady():168] get bad heartbeat response: type: BACKEND, status: BAD, msg: Invalid port 0
2024-07-19 11:17:58.276+08:00 INFO (heartbeat mgr|16) [ComputeNode.handleHbResponse():567] Backend [id=10003, host=10.0.0.130, heartbeatPort=0, alive=false] is dead due to exceed heartbeatRetryTimes
2024-07-19 11:17:58.288+08:00 INFO (heartbeat mgr|16) [CoordinatorMonitor.addDeadBackend():57] add backend 10003 to dead backend queue
2024-07-19 11:17:58.291+08:00 WARN (heartbeat mgr|16) [HeartbeatMgr.runAfterCatalogReady():168] get bad heartbeat response: type: BACKEND, status: BAD, msg: Invalid port 0
2024-07-19 11:17:58.291+08:00 INFO (heartbeat mgr|16) [ComputeNode.handleHbResponse():567] Backend [id=10004, host=10.0.0.32, heartbeatPort=0, alive=false] is dead due to exceed heartbeatRetryTimes
2024-07-19 11:17:58.291+08:00 INFO (heartbeat mgr|16) [CoordinatorMonitor.addDeadBackend():57] add backend 10004 to dead backend queue
2024-07-19 11:17:58.291+08:00 WARN (Thread-104|186) [CoordinatorMonitor$DeadBackendAndComputeNodeChecker.run():91] Cancel query [7f6d1bd6-457d-11ef-81e4-525400e3301f], because some related backend is not alive
2024-07-19 11:17:58.293+08:00 WARN (Thread-104|186) [DefaultCoordinator.cancel():855] cancel execState of query, this is outside invoke
2024-07-19 11:17:58.294+08:00 INFO (Thread-104|186) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: [7f6d1bd6-457d-11ef-81e4-525400e33022, 7f6d1bd6-457d-11ef-81e4-525400e33020, 7f6d1bd6-457d-11ef-81e4-525400e33021]
2024-07-19 11:17:58.294+08:00 INFO (Thread-104|186) [QueryRuntimeProfile.finishAllInstances():239] unfinished instances: [7f6d1bd6-457d-11ef-81e4-525400e33022, 7f6d1bd6-457d-11ef-81e4-525400e33020, 7f6d1bd6-457d-11ef-81e4-525400e33021]
2024-07-19 11:17:58.294+08:00 INFO (Thread-104|186) [DefaultCoordinator.cancel():863] count down profileDoneSignal since backend has crashed, query id: 7f6d1bd6-457d-11ef-81e4-525400e3301f
2024-07-19 11:17:58.294+08:00 WARN (thrift-server-pool-19|266) [DefaultCoordinator.updateFragmentExecStatus():946] exec state report failed status=errorCode CANCELLED InternalError, query_id=7f6d1bd6-457d-11ef-81e4-525400e3301f, instance_id=7f6d1bd6-457d-11ef-81e4-525400e33020, backend_id=10004
2024-07-19 11:17:58.294+08:00 WARN (thrift-server-pool-2|202) [DefaultCoordinator.updateFragmentExecStatus():946] exec state report failed status=errorCode CANCELLED InternalError, query_id=7f6d1bd6-457d-11ef-81e4-525400e3301f, instance_id=7f6d1bd6-457d-11ef-81e4-525400e33022, backend_id=1149055
2024-07-19 11:17:58.294+08:00 WARN (thrift-server-pool-7|207) [DefaultCoordinator.updateFragmentExecStatus():946] exec state report failed status=errorCode CANCELLED InternalError, query_id=7f6d1bd6-457d-11ef-81e4-525400e3301f, instance_id=7f6d1bd6-457d-11ef-81e4-525400e33021, backend_id=1149057
2024-07-19 11:17:58.295+08:00 WARN (AutoStatistic|28) [DefaultCoordinator.getNext():785] get next fail, need cancel. status errorCode CANCELLED InternalError, query id: 7f6d1bd6-457d-11ef-81e4-525400e3301f
2024-07-19 11:17:58.295+08:00 WARN (AutoStatistic|28) [DefaultCoordinator.getNext():811] query failed: Backend node not found. Check if any backend node is down.
2024-07-19 11:17:58.295+08:00 WARN (AutoStatistic|28) [StmtExecutor.executeStmtWithExecPlan():2405] Failed to execute executeStmtWithExecPlan
com.starrocks.common.UserException: Backend node not found. Check if any backend node is down.