【详述】集群在运行一段时间后,执行的sql卡住了,超时也不返回结果,processlist看sql有好几个小时或者更久。从集群机器CPU看,cpu并未使用。在出问题期间,我们拿下了fe的jstack日志信息。从jstack看,可能跟死锁有关
【背景】重启FE后恢复
【业务影响】集群不可用状态
【是否存算分离】否
【StarRocks版本】例如:3.3.5
【集群规模】例如:3fe+3be
【机器信息】CPU虚拟核/内存/网卡,例如:4C/32G/万兆
【联系方式】社区13群,峻
“thrift-server-pool-3489096” #5310339 daemon prio=5 os_prio=0 cpu=0.52ms elapsed=10.15s tid=0x0000eeae7c232000 nid=0x26dddd runnable [0x0000eead8d7fe000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(java.base@11.0.25/Native Method)
at java.net.SocketInputStream.socketRead(java.base@11.0.25/SocketInputStream.java:115)
at java.net.SocketInputStream.read(java.base@11.0.25/SocketInputStream.java:168)
at java.net.SocketInputStream.read(java.base@11.0.25/SocketInputStream.java:140)
at java.io.BufferedInputStream.fill(java.base@11.0.25/BufferedInputStream.java:252)
at java.io.BufferedInputStream.read1(java.base@11.0.25/BufferedInputStream.java:292)
at java.io.BufferedInputStream.read(java.base@11.0.25/BufferedInputStream.java:351)
- locked <0x000000069a5b9a78> (a java.io.BufferedInputStream)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:170)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:100)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:519)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:387)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:271)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.25/Thread.java:829)
【附件】
jstack.log.1 (378.5 KB) jstack.log.2 (367.0 KB)
升级版本, 带上这个PR的修复.