Stream load失败。报Backend反馈请求FE超时。

【详述】问题详细描述
【背景】Streamload写入,时不时会报错,有时候甚至会达到重试次数3,然后抛出异常:

2023-02-27 13:43:45,292 WARN com.starrocks.connector.flink.manager.StarRocksSinkManager [] - Failed to flush batch data to StarRocks, retry times = 0
com.starrocks.connector.flink.manager.StarRocksStreamLoadFailedException: Failed to flush data to StarRocks, Error response:
{“Status”:“Fail”,“BeginTxnTimeMs”:101,“Message”:“call frontend service failed, address=TNetworkAddress(hostname=10.5.140.21, port=9020), reason=THRIFT_EAGAIN (timed out)”,“NumberUnselectedRows”:0,“CommitAndPublishTimeMs”:0,“Label”:“f63a6304-c179-4296-b8ab-89ab8c8b311f”,“LoadBytes”:0,“StreamLoadPlanTimeMs”:0,“NumberTotalRows”:0,“WriteDataTimeMs”:0,“TxnId”:448487,“LoadTimeMs”:0,“ReadDataTimeMs”:0,“NumberLoadedRows”:0,“NumberFilteredRows”:0}
{}

at com.starrocks.connector.flink.manager.StarRocksStreamLoadVisitor.doStreamLoad(StarRocksStreamLoadVisitor.java:104) ~[flink-connector-starrocks-1.2.1_flink-1.13_2.11.jar:?]
at com.starrocks.connector.flink.manager.StarRocksSinkManager.asyncFlush(StarRocksSinkManager.java:324) ~[flink-connector-starrocks-1.2.1_flink-1.13_2.11.jar:?]
at com.starrocks.connector.flink.manager.StarRocksSinkManager.lambda$startAsyncFlushing$0(StarRocksSinkManager.java:161) ~[flink-connector-starrocks-1.2.1_flink-1.13_2.11.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

FE分配了48G内存,几乎没有访问,但是却经常报超时。
【StarRocks版本】StarRocks-2.5.1 。 2月17号下载的版本
【集群规模】1fe+6be

建议使用3FE,2 FOLLOWER + 1LEADER,通过负载均衡把请求分摊至三个FE上。

现在的问题是,FE没有请求,但是同步任务还会报超时失败…

不能只看FE,BE的负载高?

就是个只有flink导入的空集群。导入的数据量很低,才200条/s。而且也做了1分钟的攒批。BE监控也十分空闲

比如今天9:53报的错
2023-02-28 09:53:02,065 WARN com.flink.executor.connector.starrocks.manager.StarRocksSinkManager - Failed to flush batch data to StarRocks, retry times = 0
com.flink.executor.connector.starrocks.manager.StarRocksStreamLoadFailedException: Failed to flush data to StarRocks, Error response:
{“Status”:“Fail”,“BeginTxnTimeMs”:101,“Message”:“call frontend service failed, address=TNetworkAddress(hostname=10.5.140.21, port=9020), reason=THRIFT_EAGAIN (timed out)”,“NumberUnselectedRows”:0,“CommitAndPublishTimeMs”:0,“Label”:“f3599e35-e65e-4727-a9b1-63433eb33803”,“LoadBytes”:0,“StreamLoadPlanTimeMs”:0,“NumberTotalRows”:0,“WriteDataTimeMs”:0,“TxnId”:477656,“LoadTimeMs”:0,“ReadDataTimeMs”:0,“NumberLoadedRows”:0,“NumberFilteredRows”:0}

对应的集群监控信息:
FE,Fe有点奇怪的是old-gc一直在增加。另外有个集群在使用中的反而没这个情况:


BE:

先升级3.5.2再测试一下。

我也遇到相同的问题,求解!!!!!!!!

thrift_connect_timeout_seconds=3
thrift_rpc_strict_mode=1
thrift_rpc_timeout_ms=10000
这三个RPC 参数有详细说明么,都写的很泛啊? 比如streamload这个任务10分钟,这其中thrift_rpc中间做啥? 有的时候单纯加大timeout时间是可以在一定程度上避免问题,不知所以然。