2022-05-17 第一次kafka连接问题导致sr异常
#现象
SR突然异常,连接超时,写入不了数据,查询非常慢
现象1 :SR连很久都连不上
现象2 :查询服务器服务正常,但查询很慢,随便查询一条语句要查询十几S
show backends;
show broker;
show frontends;
现象3 :实时同步任务一直为运行状态,但消费不到数据
现象4 :datax任务不能正常同步
原因分析
1、FE服务器几乎0负载
2、无大查询、所有任务停止,依然查询很慢
3、查看日志
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode=
62
errorMessage:method request time out, please check
'onceTalkTimeout'
property. current value is:
60000
(MILLISECONDS) correlationId:
1
timeout with bound channel =>[id:
0x89c901c8
, L:/
10.10
.
16.238
:
31976
- R:/
10.10
.
16.238
:
8060
]
2022
-
05
-
17
18
:
30
:
02
,
806
WARN (AutoStatistic|
32
) [Coordinator.exec():
534
] exec plan fragment failed, errmsg=exec rpc error. backend id:
10003
, code: THRIFT_RPC_ERROR, fragmentId=F02, backend=
10.10
.
16.238
:
9060
2022
-
05
-
17
18
:
30
:
02
,
870
INFO (AutoStatistic|
32
) [Coordinator.cancelInternal():
807
] unfinished instance: 2a64b440-d5cc-11ec-9c1f-0050568520c9
2022
-
05
-
17
18
:
30
:
02
,
871
WARN (AutoStatistic|
32
) [SimpleScheduler.addToBlacklist():
143
] add black list
10003
2022
-
05
-
17
18
:
30
:
02
,
871
WARN (AutoStatistic|
32
) [StatisticExecutor.queryExpireTableSync():
301
] Execute statistic table query fail.
2022
-
05
-
17
19
:
56
:
38
,
053
WARN (starrocks-mysql-nio-pool-
3
|
165
) [Coordinator.exec():
519
]
catch
a execute exception
java.util.concurrent.ExecutionException: A error occurred: errorCode=
62
errorMessage:method request time out, please check
'onceTalkTimeout'
property. current value is:
60000
(MILLISECONDS) correlationId:
60
timeout with bound channel =>[
id:
0x068f910d
, L:/
10.10
.
16.238
:
62584
- R:/
10.10
.
16.239
:
8060
]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$
2
.get(ProtobufRpcProxy.java:
578
) ~[jprotobuf-rpc-core-
4.1
.
8
.jar:?]
at com.starrocks.qe.Coordinator.exec(Coordinator.java:
512
) ~[starrocks-fe.jar:?]
at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:
682
) ~[starrocks-fe.jar:?]
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:
379
) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:
268
) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:
415
) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:
651
) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$
417
(ReadListener.java:
55
) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1142
) [?:
1.8
.0_102]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617
) [?:
1.8
.0_102]
at java.lang.Thread.run(Thread.java:
745
) [?:
1.8
.0_102]
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode=
62
errorMessage:method request time out, please check
'onceTalkTimeout'
property. current value is:
60000
(MILLISECONDS) correlationId:
60
timeout with b
ound channel =>[id:
0x068f910d
, L:/
10.10
.
16.238
:
62584
- R:/
10.10
.
16.239
:
8060
]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.doWaitCallback(ProtobufRpcProxy.java:
652
) ~[jprotobuf-rpc-core-
4.1
.
8
.jar:?]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.access$
000
(ProtobufRpcProxy.java:
54
) ~[jprotobuf-rpc-core-
4.1
.
8
.jar:?]
at com.baidu.jprotobuf.pbr
查看网络是否满了
sar -n DEV 1
开启实时同步任务,连接超时,另外发现58 那台连不上
测试一下kakfa端口是否能正常连接,58那台服务器不能正常连接
原因终于找到:roting load任务配置了三台kafka,其中一个kafka忘加上SR服务器白名单了,导致这个kafka连接不上,从而引起SR异常。
解决方法
1、在58那台机器上设置SR服务器白名单
2、确认在SR配置kafka hosts
第二次 2022-06-08 SR问题
现象
SR又出现大量任务失败的情况。
测试数据库连接,也连不上了
分析
日志,一直报(Routine load scheduler|45) [KafkaRoutineLoadJob.unprotectNeedReschedule():292] ROUTINE_LOAD_JOB=19336, error_msg={Job failed to fetch all current partition with error [Failed to send proxy request: Ocurrs time out with specfied time 5 SECONDS]}。
有了前一次的经验,快速定位到KAFKA可能有问题。
查看日志,发现有台kafka台连不上
经查发现这台kafka磁盘空间不够。
解决办法
清理这台kafka磁盘空间后,kafka正常了,sr也跟着恢复正常了。
希望社区能修复这个BUG,不会因KAFKA的问题导致SR异常。稳定性很重要。