为了更快的定位您的问题,请提供以下信息,谢谢
【详述】starRocks集群follower的FE服务,停止的时候,删除了fe.pid文件,但是进程没能停止,端口依旧正常启动着。通过show frontend命令查看,该节点alive=false
【背景】FE集群停止,3节点集群,就这一个follower出问题了。
【业务影响】
【是否存算分离】存算一体
【StarRocks版本】例如:3.2.13
【集群规模】例如:3fe(1 leader+2follower)+3be
【机器信息】CPU虚拟核/内存/网卡,例如:32C/128G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】
- fe.log
查看FE日志发现如下报错:
[2025-07-08 21:46:47.412+0800][INFO ][0][thrift-server-pool-134668][ROOT][com.starrocks.transaction.DatabaseTransactionMgr][467] transaction:[TransactionState. txn_id: 2533, label: 9f9bceed-8eb5-44f1-9bb7-1b345b8e1013, db id: 10004
, table id list: 10665, callback id: -1, coordinator: BE: xxx, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1751982403649, write end time: 1751982407349, allow commit time: -1, commit tim
e: 1751982407349, finish time: -1, write cost: 3700ms, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@1a3b4105 tabletCommitInfos size: 6] successfully committed
[2025-07-08 21:46:48.261+0800][INFO ][0][Thread-69][][com.starrocks.StarRocksFE][515] FE shutdown
[2025-07-08 21:46:48.277+0800][INFO ][0][colocate group clone checker][ROOT][com.starrocks.clone.ColocateTableBalancer][935] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
[2025-07-08 21:46:48.328+0800][INFO ][0][AutoStatistic][ROOT][com.starrocks.qe.scheduler.QueryRuntimeProfile][218] unfinished instances: [fdaafe99-5c01-11f0-89d9-286ed48968ae, fdaafe99-5c01-11f0-89d9-286ed48968ac]
[2025-07-08 21:46:48.434+0800][INFO ][0][AutoStatistic][ROOT][com.starrocks.statistic.StatisticExecutor][375] statistic execute query | QueryId [fdaafe99-5c01-11f0-89d9-286ed48968aa] | SQL: SELECT xxx FROM (selectinteger_07
as column_key fromxxx
.、xxx
partitionxxx
) tt
[2025-07-08 21:46:48.474+0800][INFO ][0][PUBLISH_VERSION][ROOT][com.starrocks.transaction.DatabaseTransactionMgr][1109] finish transaction TransactionState. txn_id: 2533, label: 9f9bceed-8eb5-44f1-9bb7-1b345b8e1013, db id: 10004, t
able id list: 10665, callback id: -1, coordinator: BE: 76.3.11.6, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1751982403649, write end time: 1751982407349, allow commit time: -1, commit time: 17
51982407349, finish time: 1751982408221, write cost: 3700ms, wait for publish cost: 63ms, publish rpc cost: 805ms, finish txn cost: 4ms, publish total cost: 872ms, total cost: 4572ms, reason: attachment: com.starrocks.load.loadv2.
ManualLoadTxnCommitAttachment@1a3b4105 tabletCommitInfos size: 6 successfully
[2025-07-08 21:46:49.138+0800][WARN ][0][AutoStatistic][ROOT][com.starrocks.qe.QeProcessorImpl][124] monitor expired, query id = e4ed60dd-5c01-11f0-89d9-286ed48968aa
[2025-07-08 21:46:49.220+0800][INFO ][0][stats-cache-refresher-6][ROOT][com.starrocks.qe.scheduler.QueryRuntimeProfile][218] unfinished instances: [fdf8a96c-5c01-11f0-89d9-286ed48968ae]
[2025-07-08 21:46:49.220+0800][INFO ][0][stats-cache-refresher-6][ROOT][com.starrocks.statistic.StatisticExecutor][375] statistic execute query | QueryId [fdf8a96c-5c01-11f0-89d9-286ed48968aa] | SQL: SELECT xxx
[2025-07-08 21:46:49.646+0800][WARN ][0][heartbeat-mgr-pool-4][][com.starrocks.common.util.Util][390] failed to get result from url: https://xxx:2xxx/api/bootstrap?. Couldn’t kickstart handshaking
[2025-07-08 21:46:49.649+0800][WARN ][0][heartbeat mgr][ROOT][com.starrocks.system.HeartbeatMgr][167] get bad heartbeat response: type: FRONTEND, status: BAD, msg: got exception, name: xxxx, queryPort: 0, r
pcPort: 0, replayedJournalId: 0, feStartTime: \N, feVersion: null
[2025-07-08 21:46:49.651+0800][WARN ][0][heartbeat-mgr-pool-5][][com.starrocks.common.util.Util][390] failed to get result from url: https://xxx:xxx/api/bootstrap?. Connection refused (Connection refused)
[2025-07-08 21:46:49.653+0800][WARN ][0][heartbeat mgr][ROOT][com.starrocks.system.HeartbeatMgr][167] get bad heartbeat response: type: FRONTEND, status: BAD, msg: got exception, name: xxx.xxx, queryPort: 0, r
pcPort: 0, replayedJournalId: 0, feStartTime: \N, feVersion: null
[2025-07-08 21:46:49.722+0800][INFO ][0][analyze-task-concurrency-pool-1][ROOT][com.starrocks.transaction.DatabaseTransactionMgr][318] begin transaction: txn_id: 2536 with label insert_ff46d450-5c01-11f0-89d9-286ed48968aa from coor
dinator FE: 7xxx, listner id: -1
[2025-07-08 21:46:50.299+0800][ERROR][0][thrift-server-pool-134611][ROOT][com.starrocks.common.SRTThreadPoolServer$WorkerProcess][319] Thrift Error occurred during processing of message. org.apache.thrift.transport.TTransportExcept
ion: java.net.SocketException: Connection reset
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125)
… 9 more
-
此时观察整个进程状态处于Sl。
-
观察日志发现,出现该问题之后,整个进程的日志,标准输出均未出现任何输出信息。通过kill -3 pid之后,观察标砖输出.out文件也没有任何的输出。
-
通过jstack -F pid,进程的堆栈如下,大量的线程处于blocked状态。
Attaching to process ID 31961, please wait…
Debugger attached successfully.
Server compiler detected.
JVM version is 25.452-b09
Deadlock Detection:
No deadlocks found.
Thread 274068: (state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Interpreted frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=93 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=86 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=97 (Compiled frame)
- sun.nio.ch.SelectorImpl.select() @bci=2, line=101 (Compiled frame)
- io.netty.channel.nio.SelectedSelectionKeySetSelector.select() @bci=11, line=68 (Compiled frame)
- io.netty.channel.nio.NioEventLoop.select(long) @bci=12, line=879 (Compiled frame)
- io.netty.channel.nio.NioEventLoop.run() @bci=115, line=526 (Compiled frame)
- io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=997 (Interpreted frame)
- io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame)
- io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 274035: (state = BLOCKED)
- com.sun.org.apache.xerces.internal.dom.ParentNode.getLength() @bci=0, line=725 (Interpreted frame)
- org.apache.logging.log4j.core.config.xml.XmlConfiguration.constructHierarchy(org.apache.logging.log4j.core.config.Node, org.w3c.dom.Element) @bci=36, line=276 (Interpreted frame)
- org.apache.logging.log4j.core.config.xml.XmlConfiguration.constructHierarchy(org.apache.logging.log4j.core.config.Node, org.w3c.dom.Element) @bci=108, line=283 (Interpreted frame)
- org.apache.logging.log4j.core.config.xml.XmlConfiguration.constructHierarchy(org.apache.logging.log4j.core.config.Node, org.w3c.dom.Element) @bci=108, line=283 (Interpreted frame)
- org.apache.logging.log4j.core.config.xml.XmlConfiguration.setup() @bci=27, line=246 (Interpreted frame)
- org.apache.logging.log4j.core.config.AbstractConfiguration.initialize() @bci=221, line=256 (Interpreted frame)
- org.apache.logging.log4j.core.config.AbstractConfiguration.start() @bci=14, line=304 (Interpreted frame)
- org.apache.logging.log4j.core.LoggerContext.setConfiguration(org.apache.logging.log4j.core.config.Configuration) @bci=115, line=621 (Interpreted frame)
- org.apache.logging.log4j.core.LoggerContext.onChange(org.apache.logging.log4j.core.config.Reconfigurable) @bci=39, line=757 (Interpreted frame)
- org.apache.logging.log4j.core.util.AbstractWatcher$ReconfigurationRunnable.run() @bci=8, line=93 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 274032: (state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Interpreted frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=93 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=86 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=97 (Compiled frame)
- sun.nio.ch.SelectorImpl.select() @bci=2, line=101 (Compiled frame)
- io.netty.channel.nio.SelectedSelectionKeySetSelector.select() @bci=11, line=68 (Compiled frame)
- io.netty.channel.nio.NioEventLoop.select(long) @bci=12, line=879 (Compiled frame)
- io.netty.channel.nio.NioEventLoop.run() @bci=115, line=526 (Compiled frame)
- io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=997 (Interpreted frame)
- io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame)
- io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 273378: (state = IN_VM)
- java.lang.Shutdown.halt0(int) @bci=0 (Interpreted frame)
- java.lang.Shutdown.halt(int) @bci=7, line=156 (Interpreted frame)
- java.lang.Shutdown.exit(int) @bci=42, line=179 (Interpreted frame)
- java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52 (Interpreted frame)
- sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 268879: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=78, line=2096 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=134, line=1073 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1134 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 268329: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
- java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=160, line=460 (Compiled frame)
- java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Object, boolean, long) @bci=102, line=362 (Compiled frame)
- java.util.concurrent.SynchronousQueue.poll(long, java.util.concurrent.TimeUnit) @bci=11, line=941 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=134, line=1073 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1134 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 267283: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.ForkJoinPool.awaitWork(java.util.concurrent.ForkJoinPool$WorkQueue, int) @bci=350, line=1824 (Compiled frame)
- java.util.concurrent.ForkJoinPool.runWorker(java.util.concurrent.ForkJoinPool$WorkQueue) @bci=44, line=1693 (Compiled frame)
- java.util.concurrent.ForkJoinWorkerThread.run() @bci=24, line=175 (Compiled frame)
Thread 267282: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.ForkJoinPool.awaitWork(java.util.concurrent.ForkJoinPool$WorkQueue, int) @bci=350, line=1824 (Compiled frame)
- java.util.concurrent.ForkJoinPool.runWorker(java.util.concurrent.ForkJoinPool$WorkQueue) @bci=44, line=1693 (Compiled frame)
- java.util.concurrent.ForkJoinWorkerThread.run() @bci=24, line=175 (Compiled frame)
Thread 265192: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=78, line=2096 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=134, line=1073 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1134 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 264665: (state = BLOCKED)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
- java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, int, int) @bci=8, line=116 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int, int) @bci=117, line=171 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 (Compiled frame)
- java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=286 (Compiled frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=345 (Compiled frame)
- org.apache.thrift.transport.TIOStreamTransport.read(byte[], int, int) @bci=25, line=125 (Compiled frame)
- org.apache.thrift.transport.TTransport.readAll(byte[], int, int) @bci=22, line=86 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readAll(byte[], int, int) @bci=7, line=455 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readI32() @bci=52, line=354 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin() @bci=1, line=243 (Compiled frame)
- org.apache.thrift.TBaseProcessor.process(org.apache.thrift.protocol.TProtocol, org.apache.thrift.protocol.TProtocol) @bci=1, line=27 (Compiled frame)
- com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run() @bci=154, line=311 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 261053: (state = BLOCKED)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
- java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, int, int) @bci=8, line=116 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int, int) @bci=117, line=171 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 (Compiled frame)
- java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=286 (Compiled frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=345 (Compiled frame)
- org.apache.thrift.transport.TIOStreamTransport.read(byte[], int, int) @bci=25, line=125 (Compiled frame)
- org.apache.thrift.transport.TTransport.readAll(byte[], int, int) @bci=22, line=86 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readAll(byte[], int, int) @bci=7, line=455 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readI32() @bci=52, line=354 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin() @bci=1, line=243 (Compiled frame)
- org.apache.thrift.TBaseProcessor.process(org.apache.thrift.protocol.TProtocol, org.apache.thrift.protocol.TProtocol) @bci=1, line=27 (Compiled frame)
- com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run() @bci=154, line=311 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Thread 261052: (state = BLOCKED)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
- java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, int, int) @bci=8, line=116 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int, int) @bci=117, line=171 (Compiled frame)
- java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 (Compiled frame)
- java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=286 (Compiled frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=345 (Compiled frame)
- org.apache.thrift.transport.TIOStreamTransport.read(byte[], int, int) @bci=25, line=125 (Compiled frame)
- org.apache.thrift.transport.TTransport.readAll(byte[], int, int) @bci=22, line=86 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readAll(byte[], int, int) @bci=7, line=455 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readI32() @bci=52, line=354 (Compiled frame)
- org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin() @bci=1, line=243 (Compiled frame)
- org.apache.thrift.TBaseProcessor.process(org.apache.thrift.protocol.TProtocol, org.apache.thrift.protocol.TProtocol) @bci=1, line=27 (Compiled frame)
- com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run() @bci=154, line=311 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
-
观察当前FE的端口连接情况,发现query_port rpc_port 端口的连接一直存在。
通过tcpdump观察这些连接一直正常,还会间隔2小时发送tcp的一个判断存活的请求。