长耗时查询失败,FE连不上BE报“SocketTimeoutException: Read timed out”,但BE从日志看是正常的

【详述】使用httpclient进行查询,
sql为selectt * from table where timeId >= ‘xxx’ and timeId < ‘yyy’ order by column1 limit 1000。
【背景】查询数据表为大宽表,380列,背景数据量60亿,表大小9T
【业务影响】sql查询很久之后失败,查看FE日志原因为“ Cancel query [2e20e1bd-321a-11ef-b190-286ed4905829], because some related backend is not alive”,查看对应BE节点日志,并没有相关异常
【是否存算分离】否。存算一体。
【StarRocks版本】3.2.6
【集群规模】例如:3fe(1 follower+2observer)+9be(fe与be混部)
【机器信息】每台机器 32C/136G/万兆/starrocks磁盘1.5T
【联系方式】
【附件】

  • fe.log/beINFO/相应截图
2024-06-24 19:33:57.504+08:00 WARN (heartbeat-mgr-pool-3|141) [HeartbeatMgr$BackendHeartbeatHandler.call():325] backend heartbeat got exception, addr: 76.65.27.20:9050
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:283) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:260) ~[starrocks-fe.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_382]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_382]
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
        ... 13 more
2024-06-24 19:33:57.505+08:00 WARN (heartbeat mgr|26) [HeartbeatMgr.runAfterCatalogReady():166] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketTimeoutException: Read timed out
2024-06-24 19:34:02.520+08:00 WARN (heartbeat-mgr-pool-4|142) [HeartbeatMgr$BackendHeartbeatHandler.call():325] backend heartbeat got exception, addr: 76.65.27.20:9050
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:283) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:260) ~[starrocks-fe.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_382]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_382]
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
        ... 13 more
2024-06-24 19:34:02.521+08:00 WARN (heartbeat mgr|26) [HeartbeatMgr.runAfterCatalogReady():166] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketTimeoutException: Read timed out
2024-06-24 19:34:02.680+08:00 INFO (tablet scheduler|46) [ClusterLoadStatistic.classifyBackendByLoad():163] classify backend by load. medium: HDD, avg load score: 1.0057171045652251, low/mid/high: 0/9/0
2024-06-24 19:34:02.681+08:00 INFO (tablet scheduler|46) [TabletScheduler.updateClusterLoadStatistic():484] update cluster load statistic:
    {"beId":23495,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":22417056800,"mediums":[{"medium":"HDD","replica":430,"used":1189313450577,"total":"1.4TB","score":0.9674601428499784},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23495,"path":"/dbdata/hdfs/starrocks","pathHash":-911221383916378900,"storageMedium":"HDD","total":1581055930961,"used":1189313450577}]}
    {"beId":23489,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":23713332576,"mediums":[{"medium":"HDD","replica":434,"used":1199676189667,"total":"1.4TB","score":0.9715428010530057},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23489,"path":"/dbdata/hdfs/starrocks","pathHash":-6347946421233345028,"storageMedium":"HDD","total":1586567633891,"used":1199676189667}]}
    {"beId":23493,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":23398619264,"mediums":[{"medium":"HDD","replica":431,"used":1195971714606,"total":"1.4TB","score":0.974530696854858},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23493,"path":"/dbdata/hdfs/starrocks","pathHash":5387200276665703296,"storageMedium":"HDD","total":1576109014574,"used":1195971714606}]}
    {"beId":23490,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":23011640832,"mediums":[{"medium":"HDD","replica":435,"used":1197950818072,"total":"1.4TB","score":0.9751407862225633},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23490,"path":"/dbdata/hdfs/starrocks","pathHash":2669929127906401838,"storageMedium":"HDD","total":1577193056024,"used":1197950818072}]}
    {"beId":23492,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":23766838336,"mediums":[{"medium":"HDD","replica":435,"used":1201273132516,"total":"1.4TB","score":0.979292759654694},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23492,"path":"/dbdata/hdfs/starrocks","pathHash":1523735257628543257,"storageMedium":"HDD","total":1573644592612,"used":1201273132516}]}
    {"beId":23491,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":23679839840,"mediums":[{"medium":"HDD","replica":442,"used":1219597649906,"total":"1.4TB","score":0.986342000340271},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23491,"path":"/dbdata/hdfs/starrocks","pathHash":-615129396628447825,"storageMedium":"HDD","total":1583185278962,"used":1219597649906}]}
    {"beId":23494,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":20310327968,"mediums":[{"medium":"HDD","replica":320,"used":865581422539,"total":"983.9GB","score":1.0542333234625618},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23494,"path":"/dbdata/hdfs/starrocks","pathHash":-4050302138593708488,"storageMedium":"HDD","total":1056535685067,"used":865581422539}]}
    {"beId":23487,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":19829017552,"mediums":[{"medium":"HDD","replica":323,"used":873363533123,"total":"982.3GB","score":1.06552414916916},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23487,"path":"/dbdata/hdfs/starrocks","pathHash":5928258772759042216,"storageMedium":"HDD","total":1054738357571,"used":873363533123}]}
    {"beId":23488,"clusterName":"default_cluster","isAvailable":true,"cpuCores":32,"memLimit":38654705664,"memUsed":20284573992,"mediums":[{"medium":"HDD","replica":325,"used":881494411568,"total":"980.5GB","score":1.0773872814799335},{"medium":"SSD","replica":0,"used":0,"total":"0B","score":NaN}],"paths":[{"beId":23488,"path":"/dbdata/hdfs/starrocks","pathHash":68600672985287524,"storageMedium":"HDD","total":1052835940656,"used":881494411568}]}

2024-06-24 19:34:07.219+08:00 INFO (colocate group clone checker|111) [ColocateTableBalancer.matchGroups():903] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
2024-06-24 19:34:07.261+08:00 INFO (AutoStatistic|38) [StatisticAutoCollector.runAfterCatalogReady():70] auto collect full statistic on all databases start
2024-06-24 19:34:07.262+08:00 INFO (AutoStatistic|38) [StatisticAutoCollector.runAfterCatalogReady():88] auto collect full statistic on all databases end
2024-06-24 19:34:07.532+08:00 WARN (heartbeat-mgr-pool-6|144) [HeartbeatMgr$BackendHeartbeatHandler.call():325] backend heartbeat got exception, addr: 76.65.27.20:9050
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:283) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:260) ~[starrocks-fe.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_382]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_382]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_382]
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_382]
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
        ... 13 more
2024-06-24 19:34:07.533+08:00 WARN (heartbeat mgr|26) [HeartbeatMgr.runAfterCatalogReady():166] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketTimeoutException: Read timed out
2024-06-24 19:34:07.533+08:00 INFO (heartbeat mgr|26) [ComputeNode.handleHbResponse():549] Backend [id=23493, host=76.65.27.20, heartbeatPort=9050, alive=false] is dead due to exceed heartbeatRetryTimes
2024-06-24 19:34:07.533+08:00 INFO (heartbeat mgr|26) [CoordinatorMonitor.addDeadBackend():57] add backend 23493 to dead backend queue
2024-06-24 19:34:07.535+08:00 WARN (Thread-55|101) [CoordinatorMonitor$DeadBackendAndComputeNodeChecker.run():91] Cancel query [2e20e1bd-321a-11ef-b190-286ed4905829], because some related backend is not alive
2024-06-24 19:34:07.535+08:00 WARN (Thread-55|101) [DefaultCoordinator.cancel():809] cancel execState of query, this is outside invoke
2024-06-24 19:34:07.540+08:00 INFO (Thread-55|101) [QueryRuntimeProfile.finishAllInstances():212] unfinished instances: [2e20e1bd-321a-11ef-b190-286ed490582e, 2e20e1bd-321a-11ef-b190-286ed4905831, 2e20e1bd-321a-11ef-b190-286ed4905830, 2e20e1bd-321a-11ef-b190-286ed4905833, 2e20e1bd-321a-11ef-b190-286ed4905832, 2e20e1bd-321a-11ef-b190-286ed490582b, 2e20e1bd-321a-11ef-b190-286ed490582a, 2e20e1bd-321a-11ef-b190-286ed490582d, 2e20e1bd-321a-11ef-b190-286ed490582c, 2e20e1bd-321a-11ef-b190-286ed490582f]
2024-06-24 19:34:07.540+08:00 INFO (Thread-55|101) [DefaultCoordinator.cancel():817] count down profileDoneSignal since backend has crashed, query id: 2e20e1bd-321a-11ef-b190-286ed4905829
2024-06-24 19:34:07.554+08:00 WARN (starrocks-http-nio-pool-0|310) [DefaultCoordinator.getNext():739] get next fail, need cancel. status errorCode CANCELLED InternalError, query id: 2e20e1bd-321a-11ef-b190-286ed4905829
2024-06-24 19:34:07.554+08:00 WARN (thrift-server-pool-25|252) [DefaultCoordinator.updateFragmentExecStatus():898] exec state report failed status=errorCode CANCELLED InternalError, query_id=2e20e1bd-321a-11ef-b190-286ed4905829, instance_id=2e20e1bd-321a-11ef-b190-286ed490582a, backend_id=23488
2024-06-24 19:34:07.554+08:00 WARN (starrocks-http-nio-pool-0|310) [DefaultCoordinator.getNext():765] query failed: Backend node not found. Check if any backend node is down.
2024-06-24 19:34:07.588+08:00 INFO (starrocks-http-nio-pool-0|310) [StmtExecutor.execute():698] execute Exception, sql: select * from tbl_data_res_2080_15 where timeId>='2024-06-18 00:00:00' and timeId<'2024-06-19 00:00:00' order by r2080_i10005 desc limit 1000, error: Backend node not found. Check if any backend node is down.
2024-06-24 19:34:07.589+08:00 INFO (starrocks-http-nio-pool-0|310) [QueryRuntimeProfile.finishAllInstances():212] unfinished instances: [2e20e1bd-321a-11ef-b190-286ed490582e, 2e20e1bd-321a-11ef-b190-286ed4905831, 2e20e1bd-321a-11ef-b190-286ed4905830, 2e20e1bd-321a-11ef-b190-286ed4905833, 2e20e1bd-321a-11ef-b190-286ed4905832, 2e20e1bd-321a-11ef-b190-286ed490582b, 2e20e1bd-321a-11ef-b190-286ed490582d, 2e20e1bd-321a-11ef-b190-286ed490582c, 2e20e1bd-321a-11ef-b190-286ed490582f]
2024-06-24 19:34:07.589+08:00 INFO (starrocks-http-nio-pool-0|310) [DefaultCoordinator.cancel():817] count down profileDoneSignal since backend has crashed, query id: 2e20e1bd-321a-11ef-b190-286ed4905829
2024-06-24 19:34:07.589+08:00 INFO (starrocks-http-nio-pool-0|310) [ResourceGroupMetricMgr.createQueryResourceGroupMetrics():123] Add query_resource_group_err metric, resource group name is default_wg
2024-06-24 19:34:07.591+08:00 INFO (nioEventLoopGroup-8-3|309) [ExecuteSqlAction.handleChannelInactive():250] Netty channel is closed
2024-06-24 19:34:07.591+08:00 INFO (nioEventLoopGroup-8-3|309) [ConnectScheduler.unregisterConnection():155] Connection closed. remote=, connectionId=1
2024-06-24 19:34:07.601+08:00 INFO (profile-worker-0|253) [QeProcessorImpl.unregisterQuery():150] deregister query id = 2e20e1bd-321a-11ef-b190-286ed4905829
2024-06-24 19:34:07.719+08:00 INFO (thrift-server-pool-77|379) [QeProcessorImpl.reportExecStatus():191] ReportExecStatus() failed, query does not exist, fragment_instance_id=2e20e1bd-321a-11ef-b190-286ed490582b, query_id=2e20e1bd-321a-11ef-b190-286ed4905829,
2024-06-24 19:34:07.721+08:00 INFO (thrift-server-pool-33|265) [QeProcessorImpl.reportExecStatus():191] ReportExecStatus() failed, query does not exist, fragment_instance_id=2e20e1bd-321a-11ef-b190-286ed4905833, query_id=2e20e1bd-321a-11ef-b190-286ed4905829,
2024-06-24 19:34:07.744+08:00 INFO (tablet checker|47) [TabletChecker.doCheck():426] finished to check tablets. isUrgent: true, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
2024-06-24 19:34:07.756+08:00 INFO (thrift-server-pool-49|283) [QeProcessorImpl.reportExecStatus():191] ReportExecStatus() failed, query does not exist, fragment_instance_id=2e20e1bd-321a-11ef-b190-286ed490582d, query_id=2e20e1bd-321a-11ef-b190-286ed4905829,
2024-06-24 19:34:07.757+08:00 INFO (tablet checker|47) [TabletChecker.doCheck():426] finished to check tablets. isUrgent: false, unhealthy/total/added/in_sched/not_ready: 8/2618/0/0/0, cost: 12 ms, in lock time: 12 ms, wait time: 0ms

  • 慢查询:
    • Profile信息
    • 并行度:2
    • pipeline是否开启:true
    • be节点cpu和内存使用率截图
      be节点平均cpu利用率约4核,平均内存约23G,磁盘IO跑满100%利用率
  • 查询报错:
  • be crash
    • be.out
      be.out没看到报错
start time: Wed Jun 5 06:17:00 PM CST 2024
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Fri Jun 14 11:28:55 AM CST 2024
start time: Tue Jun 18 09:28:14 AM CST 2024
start time: Tue Jun 18 12:02:16 PM CST 2024
start time: Tue Jun 18 03:43:25 PM CST 2024
start time: Tue Jun 18 07:35:56 PM CST 2024
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Thu Jun 20 07:57:28 PM CST 2024
start time: Fri Jun 21 02:13:06 PM CST 2024
start time: Fri Jun 21 03:51:46 PM CST 2024
start time: Fri Jun 21 04:06:25 PM CST 2024
start time: Fri Jun 21 05:17:27 PM CST 2024
Ignored unknown config: storage_page_cache
start time: Fri Jun 21 05:40:05 PM CST 2024
Ignored unknown config: storage_page_cache
start time: Fri Jun 21 06:00:11 PM CST 2024
Ignored unknown config: storage_page_cache
start time: Mon Jun 24 07:07:17 PM CST 2024
Ignored unknown config: storage_page_cache

  • 外表查询报错
    • be.out和fe.warn.log
      fe.warn.log:
      fe.warn.log (23.5 KB)

show backends 结果请发下,fe.gc.log 请发下,

fe 进程

cat /proc/$fe_pid/limits|grep -E ‘Max processes|Max open files’

be 进程

cat /proc/$be_pid/limits|grep -E ‘Max processes|Max open files’ 看下结果

show backends:

BackendId,IP,HeartbeatPort,BePort,HttpPort,BrpcPort,LastStartTime,LastHeartbeat,Alive,SystemDecommissioned,ClusterDecommissioned,TabletNum,DataUsedCapacity,AvailCapacity,TotalCapacity,UsedPct,MaxDiskUsedPct,ErrMsg,Version,Status,DataTotalCapacity,DataUsedPct,CpuCores,NumRunningQueries,MemUsedPct,CpuUsedPct
23487,76.65.27.13,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,327,825.587 GB,163.862 GB,1.028 TB,84.43 %,84.43 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",989.449 GB,83.44 %,32,0,30.86 %,3.9 %
23488,76.65.27.14,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,328,830.282 GB,158.091 GB,1.028 TB,84.98 %,84.98 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",988.372 GB,84.00 %,32,0,32.58 %,5.5 %
23489,76.65.27.15,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,438,1.105 TB,353.445 GB,1.537 TB,77.54 %,77.54 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",1.450 TB,76.20 %,32,0,32.58 %,5.0 %
23490,76.65.27.16,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,439,1.104 TB,355.057 GB,1.537 TB,77.44 %,77.44 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",1.450 TB,76.09 %,32,0,32.48 %,12.2 %
23491,76.65.27.17,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,446,1.123 TB,333.857 GB,1.537 TB,78.79 %,78.79 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:26""}",1.449 TB,77.50 %,32,0,31.98 %,0.3 %
23492,76.65.27.18,9050,9060,8040,8060,2024-06-24 20:14:07,2024-06-24 20:32:02,true,false,false,438,1.105 TB,355.510 GB,1.537 TB,77.41 %,77.41 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:33""}",1.452 TB,76.09 %,32,0,31.64 %,2.2 %
23493,76.65.27.20,9050,9060,8040,8060,2024-06-24 20:15:07,2024-06-24 20:32:02,true,false,false,434,1.100 TB,358.568 GB,1.537 TB,77.22 %,77.22 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:33""}",1.450 TB,75.86 %,32,0,30.51 %,1.4 %
23494,76.65.27.21,9050,9060,8040,8060,2024-06-24 20:14:04,2024-06-24 20:32:02,true,false,false,325,823.316 GB,165.926 GB,1.027 TB,84.23 %,84.23 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",989.241 GB,83.23 %,32,0,30.48 %,0.2 %
23495,76.65.27.22,9050,9060,8040,8060,2024-06-24 20:14:09,2024-06-24 20:32:02,true,false,false,434,1.096 TB,363.100 GB,1.537 TB,76.93 %,76.93 %,,3.2.6-2585333,"{""lastSuccessReportTabletsTime"":""2024-06-24 20:31:08""}",1.450 TB,75.55 %,32,0,30.76 %,0.2 %

fe 进程

cat /proc/$fe_pid/limits|grep -E ‘Max processes|Max open files’

[root@Big-Data-03 DatabaseQuery]# cat /proc/211981/limits|grep -E 'Max processes|Max open files'
Max processes             240000               240000               processes
Max open files            240000               240000               files

be 进程

cat /proc/$be_pid/limits|grep -E ‘Max processes|Max open files’

[root@Big-Data-07 starrocks]# cat /proc/199208/limits|grep -E 'Max processes|Max open files'
Max processes             240000               240000               processes
Max open files            240000               240000               files

出问题的时间点的资源使用监控请发下

抱歉回复迟了。我这边后来发现是稳定复现的。
采集了两次异常时间点的资源使用监控,分别是用netdata直接采集的节点和进程资源使用,以及用promethus采集的starrocks资源数据。
因为不知道您具体需要哪个资源监控,所以尽量都截图了,图片比较多。

starrocks资源消耗&BE总体资源消耗(netdata采集)(异常时间2024-06-24 19:34)

starrocks资源消耗(promethus采集)(异常时间2024-06-25 18:05)







补充些信息,希望对问题定位有帮助。

我们现在发现磁盘的IO util经常长时间很高,基本保持在90%以上:

使用iotop看了下主要是starrocks_be [compact_pool] 在占用:

iostat磁盘监控:

而此时并没有做任何查询操作。
后台在进行平均1w条/s的routine load。

您好,集群如果是3fe的话建议使用3个follower。然后请您执行下show tablet from tbl_data_res_2080_15;看下这个表当前有多少个tablet,每个tablet数据量有多大。集群的导入任务都有哪些类型呢,并发导入任务多不,单节点现在是一块盘是吧。

您好,抱歉我们的环境数据清空了。tablet的相关数据我需要重新植入数据后才有,可能晚几天提供给您。
1、follower的话当前是3fe,1个leader2个follower的配置。后面我也试下加多一个follower会不会有效果。

2、集群的导入任务只有一个,也是同一个数据表的导入,用的routine load导入kafka,并发配置了9个并发,也就是每个节点一个task。
job properties如下

{"partitions":"*","partial_update":"false","columnToColumnExpr":"*","maxBatchIntervalS":"10","partial_update_mode":"null","whereExpr":"*","dataFormat":"json","timezone":"GMT+08:00","format":"json","log_rejected_record_num":"0","taskTimeoutSecond":"80","json_root":"","maxFilterRatio":"1.0","strict_mode":"false","jsonpaths":"","taskConsumeSecond":"20","desireTaskConcurrentNum":"9","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"9","maxBatchRows":"200000"}

3、单节点现在是只配了一块盘给starrocks。

另外补充些额外的信息
这个问题当时我通过修改fe.conf的心跳参数能够规避这个问题,也就是延长心跳响应的超时时间和增加重试次数:

heartbeat_timeout_second=15
heartbeat_retry_times=10

但并没有本质上解决问题。

后来通过测试发现这个问题主要发生在多并发的大数据量排序查询中,
比如10并发,查询8亿条数据的排序limit1000,此时监控系统发现系统的system load很高,峰值会达到90%+,此时就发现整个系统的各项操作和响应都出现延迟,怀疑是因为starrocks对于cpu和磁盘IO的极限压榨导致系统负载过高,无法处理其他操作,比如心跳检测,导致心跳超时,FE认为BE挂了。
而并发度小的时候,比如1并发,使用相同的sql查询相同数据量,并不会出现这个问题,此时system load大约在40%左右。
我们猜测也有可能跟使用的机器是ARM架构有关系,不是x86所以没法利用指令集进行查询加速,增加了系统的负载。
我理解这种情况理论上可以通过资源组和查询队列进行控制。

1赞

咱数据量较大的表建议根据分区的数据量设计下分桶的数量,默认的自动分桶策略分桶数量比较小,单个分桶数据量会比较大,建议单分桶数据量在GB左右,如果可以建议增加磁盘个数,单个盘容易成为瓶颈。