FlinkSQL写数据到K8s上StarRocks集群,FeProxy转发Load请求失败

【详述】Flink任务导入数据到部署在K8s上的StarRocks集群,遇到nginx偶发请求转发失败,导致Flink checkpoint失败,Kafka消费出现延迟抖动
【背景】无
【业务影响】Flink导入数据事务回滚,任务checkpoint失败,Kafka消费出现延迟抖动
【是否存算分离】否
【StarRocks版本】3.1.2
【集群规模】3fe + 3be
【报错信息】
1、fe.log
(仅收到了事务的begin请求,没有收到load请求, flink connecter检测到发送fe的load请求失败后,发送了rollback请求,在fe的日志里面仅看到了begin和rollback信息)

2024-04-08 15:08:04,127 INFO (nioEventLoopGroup-4-1|88) [TransactionLoadAction.executeTransaction():298] redirect transaction action to destination=TNetworkAddress(hostname:nap-cluster-be-1.nap-cluster-be-search.nap.svc.cluster.local, port:8040), db: stat_prod_cn, table: real_time_details, op: begin, label: cb08ef26-9ac4-423e-90f7-02c2bfa44200
2024-04-08 15:09:04,203 INFO (nioEventLoopGroup-4-3|90) [TransactionLoadAction.executeTransaction():298] redirect transaction action to destination=TNetworkAddress(hostname:nap-cluster-be-1.nap-cluster-be-search.nap.svc.cluster.local, port:8040), db: stat_prod_cn, table: real_time_details, op: rollback, label: cb08ef26-9ac4-423e-90f7-02c2bfa44200
2024-04-08 15:09:04,209 INFO (thrift-server-pool-12389172|12889949) [DatabaseTransactionMgr.abortTransaction():1228] transaction:[TransactionState. txn_id: 4294181, label: cb08ef26-9ac4-423e-90f7-02c2bfa44200, db id: 590213, table id list: 654688, callback id: -1, coordinator: BE: be_ip, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1712560084127, commit time: -1, finish time: 1712560144206, total cost: 60079ms, reason: transaction is aborted by user. attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@175ca436] successfully rollback

2、Fe Proxy(nginx)日志
(nginx日志中看到了flink有发送load请求,但是转发到fe时发生了Connection timed out了,http状态码为504)

2024/04/08 15:09:04 [error] 23#23: *5149517 upstream timed out (110: Connection timed out) while connecting to upstream, client: client_ip, server: , request: “PUT /api/transaction/load HTTP/1.1”, upstream: “http://fe_ip:8030/api/transaction/load”, host: “cn-nap-starrocks.data.private.com:30842

client_ip - nap_rw [08/Apr/2024:15:09:04 +0000] “PUT /api/transaction/load HTTP/1.1” 504 192 “-” “Apache-HttpClient/4.5.13 (Java/1.8.0_332)”

3、flink任务日志
2024-04-08 15:09:04,223 ERROR com.starrocks.connector.flink.manager.StarRocksSinkManagerV2 - catch exception, wait rollback com.starrocks.data.load.stream.exception.StreamLoadFailException: Request load failed because http response code is not 200. db: stat_prod_cn, table: real_time_details, label: cb08ef26-9ac4-423e-90f7-02c2bfa44200, response status line: HTTP/1.1 504 Gateway Time-out
at com.starrocks.data.load.stream.DefaultStreamLoader.parseHttpResponse(DefaultStreamLoader.java:347)
at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:251)
at com.starrocks.data.load.stream.DefaultStreamLoader.lambda$send$2(DefaultStreamLoader.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

这个问题哪位大佬帮忙排查下? 现在StarRocks集群部署在K8s上,Flink任务通过FeProxy实时写入数据,存在稳定性问题

社区里没人关注这个问题吗。
com.starrocks.data.load.stream.exception.StreamLoadFailException: Request abort transaction failed because http response code is not 200, 同样遇到这问题了。