【详述】StarRocks 3fe+3be存算一体部署,创建三副本表时断电一个场景,在插入表时会出现短暂的集群不可用现象,报错Execute plan fragment catch a exc
eption, address=xxxx java.lang.RuntimeException: Unable to validate object,看起来是分发计划的时候没有做backend是否alive的check。
【业务影响】集群短暂不可用
【是否存算分离】存算一体
【StarRocks版本】3.2.13
【集群规模】3fe(1leader + 2follower) + 3be
【详细报错信息】
java.lang.RuntimeException: Unable to validate object
at com.baidu.jprotobuf.pbrpc.transport.ChannelPool.getChannel(ChannelPool.java:86)
at com.baidu.jprotobuf.pbrpc.transport.RpcChannel.getConnection(RpcChannel.java:73)
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.invoke(ProtobufRpcProxy.java:499)
at com.sun.proxy.$Proxy31.execPlanFragmentAsync(Unknown Source)
at com.starrocks.rpc.BackendServiceClient.sendPlanFragmentAsync(BackendServiceClient.java:86)
at com.starrocks.rpc.BackendServiceClient.execPlanFragmentAsync(BackendServiceClient.java:115)
at com.starrocks.qe.scheduler.dag.FragmentInstanceExecState.deployAsync(FragmentInstanceExecState.java:179)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at com.starrocks.qe.scheduler.Deployer.deployFragments(Deployer.java:112)
at com.starrocks.qe.DefaultCoordinator.deliverExecFragments(DefaultCoordinator.java:589)
at com.starrocks.qe.DefaultCoordinator.startScheduling(DefaultCoordinator.java:502)
at com.starrocks.qe.scheduler.Coordinator.startScheduling(Coordinator.java:102)
at com.starrocks.qe.scheduler.Coordinator.exec(Coordinator.java:85)
at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1107)
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:619)
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:346)
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:542)
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:850)
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.NoSuchElementException: Unable to validate object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:506)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at com.baidu.jprotobuf.pbrpc.transport.ChannelPool.getChannel(ChannelPool.java:80)
… 21 more
内核层面内部来解决好像没啥好方法,可以配置监控告警及时发现,同时客户端加重试