2022-05-17 第一次kafka连接问题导致sr异常
#现象
SR突然异常,连接超时,写入不了数据,查询非常慢
现象1 :SR连很久都连不上
现象2 :查询服务器服务正常,但查询很慢,随便查询一条语句要查询十几S
show backends;
show broker;
show frontends;
现象3 :实时同步任务一直为运行状态,但消费不到数据
现象4 :datax任务不能正常同步
原因分析
1、FE服务器几乎0负载
2、无大查询、所有任务停止,依然查询很慢
3、查看日志
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode= 62 errorMessage:method request time out, please check 'onceTalkTimeout' property. current value is: 60000 (MILLISECONDS) correlationId: 1 timeout with bound channel =>[id: 0x89c901c8 , L:/ 10.10 . 16.238 : 31976 - R:/ 10.10 . 16.238 : 8060 ]
2022 - 05 - 17 18 : 30 : 02 , 806 WARN (AutoStatistic| 32 ) [Coordinator.exec(): 534 ] exec plan fragment failed, errmsg=exec rpc error. backend id: 10003 , code: THRIFT_RPC_ERROR, fragmentId=F02, backend= 10.10 . 16.238 : 9060
2022 - 05 - 17 18 : 30 : 02 , 870 INFO (AutoStatistic| 32 ) [Coordinator.cancelInternal(): 807 ] unfinished instance: 2a64b440-d5cc-11ec-9c1f-0050568520c9
2022 - 05 - 17 18 : 30 : 02 , 871 WARN (AutoStatistic| 32 ) [SimpleScheduler.addToBlacklist(): 143 ] add black list 10003
2022 - 05 - 17 18 : 30 : 02 , 871 WARN (AutoStatistic| 32 ) [StatisticExecutor.queryExpireTableSync(): 301 ] Execute statistic table query fail.
2022 - 05 - 17 19 : 56 : 38 , 053 WARN (starrocks-mysql-nio-pool- 3 | 165 ) [Coordinator.exec(): 519 ] catch a execute exception
java.util.concurrent.ExecutionException: A error occurred: errorCode= 62 errorMessage:method request time out, please check 'onceTalkTimeout' property. current value is: 60000 (MILLISECONDS) correlationId: 60 timeout with bound channel =>[
id: 0x068f910d , L:/ 10.10 . 16.238 : 62584 - R:/ 10.10 . 16.239 : 8060 ]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$ 2 .get(ProtobufRpcProxy.java: 578 ) ~[jprotobuf-rpc-core- 4.1 . 8 .jar:?]
at com.starrocks.qe.Coordinator.exec(Coordinator.java: 512 ) ~[starrocks-fe.jar:?]
at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java: 682 ) ~[starrocks-fe.jar:?]
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java: 379 ) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java: 268 ) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java: 415 ) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java: 651 ) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$ 417 (ReadListener.java: 55 ) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1142 ) [?: 1.8 .0_102]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 617 ) [?: 1.8 .0_102]
at java.lang.Thread.run(Thread.java: 745 ) [?: 1.8 .0_102]
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode= 62 errorMessage:method request time out, please check 'onceTalkTimeout' property. current value is: 60000 (MILLISECONDS) correlationId: 60 timeout with b
ound channel =>[id: 0x068f910d , L:/ 10.10 . 16.238 : 62584 - R:/ 10.10 . 16.239 : 8060 ]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.doWaitCallback(ProtobufRpcProxy.java: 652 ) ~[jprotobuf-rpc-core- 4.1 . 8 .jar:?]
at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.access$ 000 (ProtobufRpcProxy.java: 54 ) ~[jprotobuf-rpc-core- 4.1 . 8 .jar:?]
at com.baidu.jprotobuf.pbr
查看网络是否满了
sar -n DEV 1
开启实时同步任务,连接超时,另外发现58 那台连不上
测试一下kakfa端口是否能正常连接,58那台服务器不能正常连接

原因终于找到:roting load任务配置了三台kafka,其中一个kafka忘加上SR服务器白名单了,导致这个kafka连接不上,从而引起SR异常。
解决方法
1、在58那台机器上设置SR服务器白名单
2、确认在SR配置kafka hosts
第二次 2022-06-08 SR问题
现象
SR又出现大量任务失败的情况。
测试数据库连接,也连不上了

分析
日志,一直报(Routine load scheduler|45) [KafkaRoutineLoadJob.unprotectNeedReschedule():292] ROUTINE_LOAD_JOB=19336, error_msg={Job failed to fetch all current partition with error [Failed to send proxy request: Ocurrs time out with specfied time 5 SECONDS]}。
有了前一次的经验,快速定位到KAFKA可能有问题。
查看日志,发现有台kafka台连不上
经查发现这台kafka磁盘空间不够。
解决办法
清理这台kafka磁盘空间后,kafka正常了,sr也跟着恢复正常了。
希望社区能修复这个BUG,不会因KAFKA的问题导致SR异常。稳定性很重要。









