Flink job 发现链接fe leader报timed out错误

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】 Flink starrocks stream JOB found restarting due to Connection fe timed out
【背景】 通过FlinkSQL发布流式任务,将kafka数据加载到starrocks (环境K8S)
【业务影响】
【是否存算分离】否
【StarRocks版本】3.1.3
StarRocks operator 1.8.6
flink 1.17
flink-connector-starrocks-1.2.8_flink-1.17.jar
【集群规模】例如:3fe(2 follower 1 leader)+ 3be
【机器信息】CPU虚拟核/内存/网卡 fe: 4c10g be: 12c24g
【联系方式】邮箱:461050448@qq.com
【日志】 fe日志:报用户自动回滚事务。

2023-11-23 20:14:34
java.io.IOException: Could not perform checkpoint 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126.
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1256)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:107)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:118)
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
	at java.lang.Thread.run(Thread.java:750)
	Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
		at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
		at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
		at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:285)
		at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
		at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115)
		at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163)
		at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125)
		at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1043)
		at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
		at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
		at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
		at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:951)
		at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:934)
		at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
		at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:934)
		... 3 more
	Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
		at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:349)
		at com.starrocks.data.load.stream.DefaultStreamLoader.lambda$send$3(DefaultStreamLoader.java:172)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		... 1 more
	Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
		at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:304)
		... 7 more
	Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
		at java.net.PlainSocketImpl.socketConnect(Native Method)
		at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
		at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
		at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
		at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
		at java.net.Socket.connect(Socket.java:607)
		at com.starrocks.streamload.shade.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
		... 17 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126. Failure reason: Checkpoint was declined.
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:336)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:228)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:213)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:350)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1299)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1287)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1244)
	... 15 more
Caused by: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
	at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
	at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
	at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.snapshotState(StarRocksDynamicSinkFunctionV2.java:298)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:88)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:222)
	... 26 more
Caused by: [CIRCULAR REFERENCE: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265]

fe.waring.log

-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,045 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,467 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,589 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,466 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,467 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,589 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,468 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,590 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,468 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,590 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,019 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,020 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,469 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,591 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,003 WARN (heartbeat-mgr-pool-1|129) [HeartbeatMgr$BackendHeartbeatHandler.call():315] backend heartbeat got exception, addr: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local:9050
org.apache.thrift.transport.TTransportException: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
        at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:144) ~[starrocks-fe.jar:?]
        at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:129) ~[starrocks-fe.jar:?]
        at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
        at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:265) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:251) ~[starrocks-fe.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:229) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
        at java.net.Socket.connect(Socket.java:609) ~[?:?]
        at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.13.0.jar:0.13.0]
        ... 13 more
2023-11-22 11:06:23,004 WARN (heartbeat mgr|13) [HeartbeatMgr.runAfterCatalogReady():164] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,469 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,591 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local

集群状态正常吗?show frontends

是正常的。 我们通过抓包,发现请求发送到fe follower 重定义到leader请求,该重定义请求实际被没有被发送出去。 所以我们关闭了fe follower 问题就消失了。

通过设置k8s svc 会话保持半个小时,该现象消失了。

请教下,你修改的是fe还是fe proxy的service会话保持?我们也遇到了这个问题

fe的

be 的也要

好的,非常感谢,我这边修改Service会话保持之后,线上实时写入也恢复正常了 :+1::+1::+1: