【详述】从2.3.8.升级到2.5.4/2.5.5,BE节点经常宕机
【背景】从2.3.8.升级到2.5.4/2.5.5,BE节点经常宕机
【StarRocks版本】2.5.5
【集群规模】3fe(3 follower)+5be
【联系方式】15623937986
【附件】
BE节点宕机日志:
I0522 12:58:57.051426 38675 daemon.cpp:188] Current memory statistics: process(21474469856), query_pool(0), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2227948024), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 12:58:57.991765 42447 heartbeat_server.cpp:63] get heartbeat from FE.host:10.22.13.234, port:19020, cluster id:461923451, counter:3193
I0522 12:59:08.927392 41741 load_channel_mgr.cpp:225] Memory consumption(bytes) limit=58287716204 current=9008 peak=254520240
I0522 12:59:10.077042 41931 fragment_executor.cpp:158] Prepare(): query_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc0 fragment_instance_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc9 backend_num=3
I0522 12:59:10.091571 42073 fragment_executor.cpp:158] Prepare(): query_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc0 fragment_instance_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc3 backend_num=8
W0522 12:59:10.206810 39722 exec_state_reporter.cpp:129] Retrying ReportExecStatus: No more data to read.
I0522 12:59:10.207813 39722 pipeline_driver_executor.cpp:240] [Driver] Succeed to report exec state: fragment_instance_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc3
W0522 12:59:10.284379 39722 exec_state_reporter.cpp:129] Retrying ReportExecStatus: No more data to read.
I0522 12:59:10.285063 39722 pipeline_driver_executor.cpp:240] [Driver] Succeed to report exec state: fragment_instance_id=63564b4a-f85d-11ed-ae84-d4f5ef304bc9
I0522 12:59:12.059954 38675 daemon.cpp:188] Current memory statistics: process(21481204480), query_pool(0), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2233944530), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 12:59:22.901392 38674 daemon.cpp:92] Released 1048576 bytes from column pool
I0522 12:59:27.068373 38675 daemon.cpp:188] Current memory statistics: process(21480107696), query_pool(0), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2232895954), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 12:59:32.967772 41723 tablet_manager.cpp:820] Report all 624884 tablets info
I0522 12:59:32.967943 41723 thrift_client.cpp:81] Error closing connection to: 10.22.13.234:19020, ignoring (write() send(): Broken pipe)
I0522 12:59:42.076748 38675 daemon.cpp:188] Current memory statistics: process(21479934304), query_pool(0), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2232895954), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 12:59:57.088109 38675 daemon.cpp:188] Current memory statistics: process(21479828704), query_pool(0), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2232895954), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 12:59:58.123301 42447 heartbeat_server.cpp:63] get heartbeat from FE.host:10.22.13.234, port:19020, cluster id:461923451, counter:3205
I0522 13:00:01.532366 42069 fragment_executor.cpp:158] Prepare(): query_id=82013efe-f85d-11ed-b8ab-d4f5ef301410 fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301411 backend_num=0
I0522 13:00:01.532939 42069 profile_report_worker.cpp:31] register_pipeline_load query_id=82013efe-f85d-11ed-b8ab-d4f5ef301410, fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301411
I0522 13:00:01.538980 42065 fragment_executor.cpp:158] Prepare(): query_id=82013efe-f85d-11ed-b8ab-d4f5ef301410 fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301414 backend_num=4
I0522 13:00:01.540382 42065 profile_report_worker.cpp:31] register_pipeline_load query_id=82013efe-f85d-11ed-b8ab-d4f5ef301410, fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301414
I0522 13:00:01.746116 39860 profile_report_worker.cpp:47] unregister_pipeline_load query_id=82013efe-f85d-11ed-b8ab-d4f5ef301410, fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301414
W0522 13:00:01.758311 39722 exec_state_reporter.cpp:129] Retrying ReportExecStatus: write() send(): Broken pipe
I0522 13:00:01.758358 39722 thrift_client.cpp:81] Error closing connection to: 10.22.13.234:19020, ignoring (write() send(): Broken pipe)
I0522 13:00:01.762933 39722 pipeline_driver_executor.cpp:240] [Driver] Succeed to report exec state: fragment_instance_id=82013efe-f85d-11ed-b8ab-d4f5ef301414
I0522 13:00:01.787835 40546 jdbc_driver_manager.cpp:150] driver[oraclejdbc_mir_online] already exists
I0522 13:00:03.131991 42447 heartbeat_server.cpp:80] Updating master info: TMasterInfo(network_address=TNetworkAddress(hostname=10.22.13.234, port=19020), cluster_id=461923451, epoch=80, token=5e31602b-3966-4c58-936f-9e2f94311948, backend_ip=10.22.13.237, http_port=18030, heartbeat_flags=0, backend_id=10381, min_active_txn_id=17334367)
I0522 13:00:08.927951 41963 local_tablets_channel.cpp:507] cancel LocalTabletsChannel txn_id: 17334366 load_id: 50dc359cf85911ed-ae84d4f5ef304bc0 index_id: 9227642 #tablet:3 tablet_ids:9227685,9227681,9227669
I0522 13:00:08.927979 41963 load_channel_mgr.cpp:220] Deleted timeout channel. load id=50dc359cf85911ed-ae84d4f5ef304bc0 timeout=1800
I0522 13:00:08.927989 41963 load_channel_mgr.cpp:225] Memory consumption(bytes) limit=58287716204 current=18016 peak=254520240
I0522 13:00:10.078866 41961 fragment_executor.cpp:158] Prepare(): query_id=87196a2f-f85d-11ed-b8ab-d4f5ef301410 fragment_instance_id=87196a2f-f85d-11ed-b8ab-d4f5ef301417 backend_num=1
I0522 13:00:10.092798 41954 fragment_executor.cpp:158] Prepare(): query_id=87196a2f-f85d-11ed-b8ab-d4f5ef301410 fragment_instance_id=87196a2f-f85d-11ed-b8ab-d4f5ef301413 backend_num=8
W0522 13:00:10.270107 39722 exec_state_reporter.cpp:129] Retrying ReportExecStatus: No more data to read.
I0522 13:00:10.270694 39722 pipeline_driver_executor.cpp:240] [Driver] Succeed to report exec state: fragment_instance_id=87196a2f-f85d-11ed-b8ab-d4f5ef301413
W0522 13:00:10.327071 39722 exec_state_reporter.cpp:129] Retrying ReportExecStatus: No more data to read.
I0522 13:00:10.327720 39722 pipeline_driver_executor.cpp:240] [Driver] Succeed to report exec state: fragment_instance_id=87196a2f-f85d-11ed-b8ab-d4f5ef301417
I0522 13:00:12.099540 38675 daemon.cpp:188] Current memory statistics: process(21720628816), query_pool(204260152), load(9008), metadata(1414599746), compaction(0), schema_change(0), column_pool(2262061796), page_cache(13661878384), update(504206), chunk_allocator(617479808), clone(0), consistency(0)
I0522 13:00:12.902689 38674 daemon.cpp:92] Released 8388608 bytes from column pool
I0522 13:01:02.403997 72716 daemon.cpp:277] version 2.5.5 RELEASE (build 24c1eca)
Built on 2023-04-27 09:02:38 by StarRocks@localhost
I0522 13:01:02.411518 72716 mem_info.cpp:91] Physical Memory: 251.32 GB
查看FE主节点日志,如下:
TStat num of tablet being scheduled: 8691 (+0)
TStat num of tablet being scheduled succeeded: 2596 (+0)
TStat num of tablet being scheduled failed: 6082 (+0)
TStat num of tablet being scheduled discard: 1 (+0)
TStat num of tablet priority upgraded: 0 (+0)
TStat num of tablet priority downgraded: 0 (+0)
TStat num of clone task: 2594 (+0)
TStat num of clone task succeeded: 2593 (+0)
TStat num of clone task failed: 0 (+0)
TStat num of clone task timeout: 0 (+0)
TStat num of replica missing error: 0 (+0)
TStat num of replica version missing error: 0 (+0)
TStat num of replica unavailable error: 0 (+0)
TStat num of replica redundant error: 4 (+0)
TStat num of replica missing in cluster error: 0 (+0)
TStat num of balance scheduled: 8687 (+0)
TStat num of colocate replica mismatch: 0 (+0)
TStat num of colocate replica redundant: 0 (+0)
TStat num of colocate balancer running round: 0 (+0)
2023-05-22 13:00:11,301 INFO (starrocks-mysql-nio I/O-3|109) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.234:36132
2023-05-22 13:00:16,214 INFO (starrocks-mysql-nio I/O-1|107) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.233:54027
2023-05-22 13:00:17,432 INFO (starrocks-mysql-nio I/O-4|110) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.234:36150
2023-05-22 13:00:18,119 INFO (colocate group clone checker|133) [ColocateTableBalancer.matchGroup():843] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2023-05-22 13:00:20,115 INFO (nioEventLoopGroup-5-1|249) [RestBaseAction.handleRequest():57] receive http request. url=/metrics
2023-05-22 13:00:23,271 WARN (heartbeat-mgr-pool-7|157) [HeartbeatMgr$BackendHeartbeatHandler.call():324] backend heartbeat got exception, addr: 10.22.13.237:19050
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:284) [starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:262) [starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_181]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_181]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_181]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_181]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_181]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_181]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_181]
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
... 13 more
2023-05-22 13:00:23,272 WARN (heartbeat mgr|46) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketTimeoutException: Read timed out
2023-05-22 13:00:26,074 INFO (starrocks-mysql-nio I/O-4|110) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.233:54039
2023-05-22 13:00:27,023 INFO (tablet scheduler|47) [ClusterLoadStatistic.classifyBackendByLoad():151] classify backend by load. medium: HDD, avg load score: 1.0005433417077445, low/mid/high: 0/5/0
2023-05-22 13:00:27,432 INFO (starrocks-mysql-nio I/O-4|110) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.234:36194
2023-05-22 13:00:30,126 INFO (starrocks-mysql-nio I/O-3|109) [AcceptListener.handleEvent():57] Connection established. remote=/10.22.13.233:54049
2023-05-22 13:00:30,844 INFO (tablet checker|48) [TabletChecker.doCheck():396] finished to check tablets. checkInPrios: true, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms
2023-05-22 13:00:31,207 INFO (ReportHandler|112) [ReportHandler.tabletReport():337] backend[13649] reports 506223 tablet(s). report version: 16847298620000
2023-05-22 13:00:31,231 WARN (tablet scheduler|47) [TabletScheduler.runAfterCatalogReady():357] select balance tablets cost too much time: 4 seconds
2023-05-22 13:00:31,456 WARN (heartbeat-mgr-pool-0|150) [HeartbeatMgr$BackendHeartbeatHandler.call():324] backend heartbeat got exception, addr: 10.22.13.237:19050
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?]
at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:284) [starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:262) [starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_181]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_181]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_181]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_181]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_181]
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
… 13 more
2023-05-22 13:00:31,457 WARN (pool-16-thread-8|256) [KafkaUtil$ProxyAPI.sendProxyRequest():192] failed to send proxy request to TNetworkAddress(hostname:10.22.13.237, port:18060) err A error occurred: errorCode=2001 errorMessage:Connection reset by peer
2023-05-22 13:00:31,457 WARN (pool-16-thread-2|244) [KafkaUtil$ProxyAPI.sendProxyRequest():192] failed to send proxy request to TNetworkAddress(hostname:10.22.13.237, port:18060) err A error occurred: errorCode=2001 errorMessage:Connection reset by peer
2023-05-22 13:00:31,457 WARN (heartbeat mgr|46) [HeartbeatMgr.runAfterCatalogReady():165] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketException: Connection reset
2023-05-22 13:00:31,457 WARN (pool-16-thread-2|244) [RoutineLoadTaskScheduler.scheduleOneTask():191] check task ready to execute failed
com.starrocks.common.LoadException: failed to send proxy request to TNetworkAddress(hostname:10.22.13.237, port:18060) err A error occurred: errorCode=2001 errorMessage:Connection reset by peer
at com.starrocks.common.util.KafkaUtil$ProxyAPI.sendProxyRequest(KafkaUtil.java:193) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.KafkaUtil$ProxyAPI.getOffsets(KafkaUtil.java:134) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.KafkaUtil$ProxyAPI.getLatestOffsets(KafkaUtil.java:114) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.KafkaUtil.getLatestOffsets(KafkaUtil.java:66) ~[starrocks-fe.jar:?]
at com.starrocks.load.routineload.KafkaTaskInfo.readyToExecute(KafkaTaskInfo.java:86) ~[starrocks-fe.jar:?]
at com.starrocks.load.routineload.RoutineLoadTaskScheduler.scheduleOneTask(RoutineLoadTaskScheduler.java:173) ~[starrocks-fe.jar:?]
at com.starrocks.load.routineload.RoutineLoadTaskScheduler.lambda$submitToSchedule$1(RoutineLoadTaskScheduler.java:152) ~[starrocks-fe.jar:?]