为了更快的定位您的问题,请提供以下信息,谢谢
【详述】 Flink starrocks stream JOB found restarting due to Connection fe timed out
【背景】 通过FlinkSQL发布流式任务,将kafka数据加载到starrocks (环境K8S)
【业务影响】
【是否存算分离】否
【StarRocks版本】3.1.3
StarRocks operator 1.8.6
flink 1.17
flink-connector-starrocks-1.2.8_flink-1.17.jar
【集群规模】例如:3fe(2 follower 1 leader)+ 3be
【机器信息】CPU虚拟核/内存/网卡 fe: 4c10g be: 12c24g
【联系方式】邮箱:461050448@qq.com
【日志】 fe日志:报用户自动回滚事务。
2023-11-23 20:14:34
java.io.IOException: Could not perform checkpoint 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126.
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1256)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:107)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:118)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:750)
Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:285)
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115)
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125)
at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1043)
at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:951)
at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:934)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:934)
... 3 more
Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:349)
at com.starrocks.data.load.stream.DefaultStreamLoader.lambda$send$3(DefaultStreamLoader.java:172)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at com.starrocks.streamload.shade.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:304)
... 7 more
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at com.starrocks.streamload.shade.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 17 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126. Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:336)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:228)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:213)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:350)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1299)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1287)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1244)
... 15 more
Caused by: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.snapshotState(StarRocksDynamicSinkFunctionV2.java:298)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:88)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:222)
... 26 more
Caused by: [CIRCULAR REFERENCE: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265]
fe.waring.log
-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,045 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,467 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,589 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,466 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,467 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,589 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,468 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,590 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,468 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,590 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,019 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,020 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,469 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,591 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,003 WARN (heartbeat-mgr-pool-1|129) [HeartbeatMgr$BackendHeartbeatHandler.call():315] backend heartbeat got exception, addr: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local:9050
org.apache.thrift.transport.TTransportException: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:144) ~[starrocks-fe.jar:?]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:129) ~[starrocks-fe.jar:?]
at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:265) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:251) ~[starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:229) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.13.0.jar:0.13.0]
... 13 more
2023-11-22 11:06:23,004 WARN (heartbeat mgr|13) [HeartbeatMgr.runAfterCatalogReady():164] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,469 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,591 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local