【详述】FE的Leader运行一段时间后,8030端口响应变慢。
如Flink 通过StreamLoad写入数据到SR出现超时。
Grafana 界面上的图表无法获取Leader的信息,集群的Tablet为N/A。
手动执行获取Leader的metrics(http://fe_leader_ip:8030/metrics)也响应慢。
Leader 的8030慢的时候,Follower的8030是正常的,手动执行http获取metris就能对比出来。
重启FE的Leader后,上述问题就解决了,但是过一段时间,Leader的8030又会变慢,目前都是通过重启FE的Leader来解决的。
【背景】无
【业务影响】影响实时作业写入,granfana看集群监控
【是否存算分离】否
【StarRocks版本】V2.5.20
【集群规模】4fe(3 follower+1observer)+12be(fe与be单独部署)
【机器信息】CPU虚拟核/内存/网卡,104C/384G/万兆
【联系方式】社区群9-微笑人生
【附件】
-
flink报错日志:
实时计算任务的错误日志:
Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Request load failed because http response code is 307 which means ‘Temporary Redirect’. This can happen when FE responds the request slowly , you should find the reason first. The reason may be StarRocks FE/Flink/Spark GC, network delay, or others. db: ads_ec, table: rt_o_ec_order_ec_oms_plat_order_record_analysis, label: flink-7d66a75b-fda3-4a60-82ed-f554f3bd16ba, response status line: HTTP/1.1 307 Temporary Redirect
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427) ~[flink-connector-starrocks-1.2.9_flink-1.18.jar:?]
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355) ~[flink-connector-starrocks-1.2.9_flink-1.18.jar:?]
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:251) ~[flink-connector-starrocks-1.2.9_flink-1.18.jar:?]
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41) ~[bd-flink-common-5.0.0.jar:?]
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1062) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255) ~[bd-flink-common-5.0.0.jar:?]
at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72) ~[bd-flink-common-5.0.0.jar:?]
at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[b -
granfana无法查看到集群tablet信息
-
FE的Leader发生慢的时候没有做Full GC