fqdn组件集群,pod下电再上电后,FE和BE通信存在问题,一段时间后自动恢复

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】对一个be的pod和一个fe的pod下电,下电后内表查询正常;上电后查询报超时;目前怀疑下电再上电后pod ip发生变化影响的,但是当前FQDN方式早就不再依赖ip,部署的时候也没有依赖IP,从下面的日志来看,上电后,brpc的channel取出来还是带了下电前的be pod ip,过了一段时间(时间不太固定有时候得20几分钟)查询自动恢复正常后,brpc的channel取出来就变成了重新上电后的be pod ip。感觉像这种pod突发异常掉电的情况,社区是不是没有考虑
【背景】对一个be的pod和一个fe的pod下电,下电后内表查询正常;上电后查询报超时
【业务影响】
【是否存算分离】存算一体
【StarRocks版本】3.2.13
【集群规模】3fe+3be fqdn的方式组建

【附件】
[2025-07-03 16:27:27.786 +0800] [] [] [WARN] [starrocks-mysql-nio-pool-94] [FragmentInstanceExecState.java] [com.starrocks.qe.scheduler.dag.FragmentInstanceExecState] [waitForDeploymentCompletion] [274] catch a execute exception java.util.concurrent.ExecutionException: A error occurred: errorCode=62 errorMessage:method request time out, please check ‘onceTalkTimeout’ property. current value is:60000(MILLISECONDS) correlationId:275 timeout with bound channel =>[id: 0xbaa63cc9, L:/20.20.156.146:33578 - R:odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local/20.20.65.156:28243]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:578)
at com.starrocks.qe.scheduler.dag.FragmentInstanceExecState.waitForDeploymentCompletion(FragmentInstanceExecState.java:268)
at com.starrocks.qe.scheduler.Deployer.waitForDeploymentCompletion(Deployer.java:225)
at com.starrocks.qe.scheduler.Deployer.deployFragments(Deployer.java:116)
at com.starrocks.qe.DefaultCoordinator.deliverExecFragments(DefaultCoordinator.java:589)
at com.starrocks.qe.DefaultCoordinator.startScheduling(DefaultCoordinator.java:502)
at com.starrocks.qe.scheduler.Coordinator.startScheduling(Coordinator.java:102)
at com.starrocks.qe.scheduler.Coordinator.exec(Coordinator.java:85)
at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1132)
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:634)
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:346)
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:542)
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:850)
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:70)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode=62 errorMessage:method request time out, please check ‘onceTalkTimeout’ property. current value is:60000(MILLISECONDS) correlationId:275 timeout with bound channel =>[id: 0xbaa63cc9, L:/20.20.156.146:33578 - R:odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local/20.20.65.156:28243]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.doWaitCallback(ProtobufRpcProxy.java:651)
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.access$0(ProtobufRpcProxy.java:611)
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:576)
… 16 more

[2025-07-03 16:27:27.788 +0800] [] [] [WARN] [starrocks-mysql-nio-pool-94] [ExecuteExceptionHandler.java] [com.starrocks.qe.ExecuteExceptionHandler] [handleRpcException] [96] Query cancelled by crash of backends or RpcException, [QueryId=6a15f144-57e7-11f0-8099-ea28c28adb39] [SQL=SELECT pr_dt_wytest1953557994_logical.dn AS dn, pr_dt_wytest1953557994_logical.timestamp AS timestamp FROM (SELECT pr_dt_wytest1953557994.timestamp AS timestamp, pr_dt_wytest1953557994.dn AS dn, pr_dt_wytest1953557994.createTime AS createTime, pr_dt_wytest1953557994.arrivalTime AS arrivalTime, pr_dt_wytest1953557994.saveTime AS saveTime, pr_dt_wytest1953557994.le AS le, pr_dt_wytest1953557994.go_gc_heap_allocs_by_size_bytes_bucket AS go_gc_heap_allocs_by_size_bytes_bucketFROM _default.dte__DEFAULT_pr_dt_wytest1953557994 AS pr_dt_wytest1953557994) AS pr_dt_wytest1953557994_logical WHERE pr_dt_wytest1953557994_logical.dn = ‘NE=U7v767kBRipMuCu7lBfPA’ AND pr_dt_wytest1953557994_logical.timestamp >= 1751161239103 AND pr_dt_wytest1953557994_logical.timestamp <= 1751334039103 ORDER BY timestamp DESC LIMIT 1] [Plan=PLAN COST\n CPU: 1412.5039265664927\n Memory: 64.0\n\nPLAN FRAGMENT 0(F01)\n Output Exprs:2: dn | 1: timestamp\n Input Partition: UNPARTITIONED\n RESULT SINK\n\n 2:MERGING-EXCHANGE\n distribution type: GATHER\n partition type: UNPARTITIONED\n limit: 1\n cardinality: 1\n column statistics: \n * timestamp–>[1.751333943E12, 1.751334039103E12, 0.0, 8.0, 21.57037385260145] ESTIMATE\n * dn–>[-Infinity, Infinity, 0.0, 24.0, 1.0] ESTIMATE\n\nPLAN FRAGMENT 1(F00)\n\n Input Partition: RANDOM\n OutPut Partition: UNPARTITIONED\n OutPut Exchange Id: 02\n\n 1:TOP-N\n | order by: [1, BIGINT, true] DESC\n | build runtime filters:\n | - filter_id = 0, build_expr = (<slot 1> 1: timestamp), remote = false\n | offset: 0\n | limit: 1\n | cardinality: 1\n | column statistics: \n | * timestamp–>[1.751333943E12, 1.751334039103E12, 0.0, 8.0, 21.57037385260145] ESTIMATE\n | * dn–>[-Infinity, Infinity, 0.0, 24.0, 1.0] ESTIMATE\n | \n 0:OlapScanNode\n table: dte__DEFAULT_pr_dt_wytest1953557994, rollup: dte__DEFAULT_pr_dt_wytest1953557994\n preAggregation: on\n Predicates: [2: dn, VARCHAR, true] = ‘NE=U7v767kBRipMuCu7lBfPA’, [1: timestamp, BIGINT, true] >= 1751161239103, [1: timestamp, BIGINT, true] <= 1751334039103\n partitionsRatio=45/46, tabletsRatio=135/135\n tabletList=45261,45265,45269,46064,46068,46072,49827,49831,49835,53914 …\n actualRows=53960, avgRowSize=32.0\n cardinality: 22\n probe runtime filters:\n - filter_id = 0, probe_expr = (<slot 1> 1: timestamp)\n column statistics: \n * timestamp–>[1.751333943E12, 1.751334039103E12, 0.0, 8.0, 21.57037385260145] ESTIMATE\n * dn–>[-Infinity, Infinity, 0.0, 24.0, 1.0] ESTIMATE\n] com.starrocks.rpc.RpcException: rpc failed with odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local: exec rpc error. backend [id=97199]
at com.starrocks.qe.DefaultCoordinator.handleErrorExecution(DefaultCoordinator.java:607)
at com.starrocks.qe.scheduler.Deployer.waitForDeploymentCompletion(Deployer.java:244)
at com.starrocks.qe.scheduler.Deployer.deployFragments(Deployer.java:116)
at com.starrocks.qe.DefaultCoordinator.deliverExecFragments(DefaultCoordinator.java:589)
at com.starrocks.qe.DefaultCoordinator.startScheduling(DefaultCoordinator.java:502)
at com.starrocks.qe.scheduler.Coordinator.startScheduling(Coordinator.java:102)
at com.starrocks.qe.scheduler.Coordinator.exec(Coordinator.java:85)
at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1132)
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:634)
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:346)
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:542)
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:850)
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:70)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.ExecutionException: A error occurred: errorCode=62 errorMessage:method request time out, please check ‘onceTalkTimeout’ property. current value is:60000(MILLISECONDS) correlationId:275 timeout with bound channel =>[id: 0xbaa63cc9, L:/20.20.156.146:33578 - R:odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local/20.20.65.156:28243]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:578)
at com.starrocks.qe.scheduler.dag.FragmentInstanceExecState.waitForDeploymentCompletion(FragmentInstanceExecState.java:268)
at com.starrocks.qe.scheduler.Deployer.waitForDeploymentCompletion(Deployer.java:225)
… 14 more
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode=62 errorMessage:method request time out, please check ‘onceTalkTimeout’ property. current value is:60000(MILLISECONDS) correlationId:275 timeout with bound channel =>[id: 0xbaa63cc9, L:/20.20.156.146:33578 - R:odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local/20.20.65.156:28243]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.doWaitCallback(ProtobufRpcProxy.java:651)
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.access$0(ProtobufRpcProxy.java:611)
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:576)
at com.starrocks.qe.scheduler.dag.FragmentInstanceExecState.waitForDeploymentCompletion(FragmentInstanceExecState.java:268)
at com.starrocks.qe.scheduler.Deployer.waitForDeploymentCompletion(Deployer.java:225)

odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local/20.20.65.156

POD重启后, 这个POD对应的IP地址更换了, 是吗?

是的,ip变了

请问有什么思路吗?一开始怀疑k8s的问题。但是上电后通过域名odaeqebeservice-0.odaeqebeservice-svc.sop.svc.cluster.local手动解析了一下,解析出来就是新的ip,没什么毛病。所以还是怀疑starrocks自己哪里缓存了,一直连接超时就会刷新下

(帖子被作者删除,如无标记将在 24 小时后自动删除)

应该是brpc有缓存连接导致的.

是的,但是没找到问题代码的位置 :disappointed_relieved:

ProtobufRpcProxy 里的channel缓存机制

感谢感谢,之前也改过这里,没生效。今天重新理了遍 jprotobuf的代码,修改后验证ok了

我看baidu.jprotobuf这个框架是不是没人维护了? :joy:

基本不更新了, 会考虑在SR里自己维护更新

可以考虑如何修复, 给他们提PR, 如果没人处理, 可以给我们的fork提: https://github.com/StarRocks/Jprotobuf-rpc-socket

好的好的