【详述】问题详细描述
用户SQL中包含的grouping set维度过多,还有count distinct逻辑师,导致的所有CN节点异常重启。3.2.14版本中存在此问题,目前升级到3.5.12还是存在此问题
【背景】做过哪些操作?
在执行包含 聚合去重(Distinct Aggregate) 或 大宽表 Join 的复杂查询时触发。查询涉及外部 Hive 表的元数据读取(HiveMetaClient 日志可见)。
【业务影响】
查询任务失败,若并发高可能导致部分节点反复崩溃,影响集群稳定性。
【是否存算分离】
是
【StarRocks版本】
3.5.12 (build a2e4b58 distro ubuntu arch x86_64)
【集群规模】
- FE: 3 节点
- CN: 48 节点
【机器信息】
- CPU: 32C
- 内存: 200G
- 网卡: 万兆
【附件】
(注:以下为日志关键片段提取,实际提交时请附带完整文件)
1. CN 异常日志 (Crash Stack Trace):
3.5.12 RELEASE (build a2e4b58 distro ubuntu arch x86_64)
query_id:ca56d42f-82cd-478e-b8f0-0c87f3ba7429, fragment_instance:ca56d42f-82cd-478e-b8f0-0c87f3ba75ae, plan_node_id:83
*** Aborted at 1776300993 (unix time) try “date -d @1776300993” if you are using GNU date ***
PC: @ 0x8913170 starrocks::AdaptiveSliceHashSet::emplace(starrocks::MemPool*, starrocks::Slice)
*** SIGSEGV (@0x7f1797214000) received by PID 27 (TID 0x7f1855cf3640) LWP(658) from PID 18446744071950123008; stack trace: ***
@ 0x7f19089a0ee8 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x99ee7)
@ 0xffe6189 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f1909b4b706 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0x7f1909b4c17e JVM_handle_linux_signal
@ 0x7f1908949520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
@ 0x8913170 starrocks::AdaptiveSliceHashSet::emplace(starrocks::MemPool*, starrocks::Slice)
@ 0x8e87348 starrocks::NullableAggregateFunctionUnary<std::shared_ptr<starrocks::TDistinctAggregateFunction<(starrocks::LogicalType)13, (starrocks::LogicalType)13, starrocks::DistinctAggregateStateV2, (starrocks::AggDistinctType)0, starrocks::Slice> >, starrocks::Null?
@ 0x8765cc4 starrocks::Aggregator::compute_batch_agg_states(starrocks::Chunk*, unsigned long)
@ 0x95736f9 starrocks::pipeline::AggregateBlockingSinkOperator::push_chunk(starrocks::RuntimeState*, std::shared_ptrstarrocks::Chunk const&)
@ 0x8261ed3 starrocks::pipeline::PipelineDriver::process(starrocks::RuntimeState*, int)
@ 0xbbb8416 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0xc0f245b starrocks::ThreadPool::dispatch_thread()
@ 0xc0e8c49 starrocks:
:supervise_thread(void*)
@ 0x7f190899bac3 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x94ac2)
@ 0x7f1908a2ca74 clone
[1776300993.546][thread: 139742495585856] je_mallctl execute purge success
[1776300993.546][thread: 139742495585856] je_mallctl execute dontdump success
start time: Thu Apr 16 08:56:54 CST 2026, server uptime: 08:56:54 up 205 days, 16:43, 0 users, load average: 28.92, 38.16, 38.25
2. FE 异常日志 (Frontend Error):
查询ID对应关系
2026-04-16 08:48:15.264+08:00 INFO (starrocks-mysql-nio-pool-28275|30221465) [MetadataMgr.lambda$static$0():137] Evict cache due to EXPIRED and deregister quer
y-level connector metadata on query id: 408a8fb2-392d-11f1-8f3c-5eae4fd3a96f
2026-04-16 08:48:15.423+08:00 ERROR (thrift-server-pool-408111|30253803) [SRTThreadPoolServer$WorkerProcess.run():319] Thrift Error occurred during processing
of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:174) ~[libthrift-0.20.0.jar:0.20.0]
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:100) ~[libthrift-0.20.0.jar:0.20.0]
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:519) ~[libthrift-0.20.0.jar:0.20.0]
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:387) ~[libthrift-0.20.0.jar:0.20.0]
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:271) ~[libthrift-0.20.0.jar:0.20.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) ~[libthrift-0.20.0.jar:0.20.0]
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.net.SocketException: Connection reset
at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328) ~[?:?]
at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) ~[?:?]
at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) ~[?:?]
at java.net.Socket$SocketInputStream.read(Socket.java:966) ~[?:?]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:244) ~[?:?]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) ~[?:?]
at java.io.BufferedInputStream.read(BufferedInputStream.java:343) ~[?:?]
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:170) ~[libthrift-0.20.0.jar:0.20.0]
… 9 more
2026-04-16 08:48:15.423+08:00 ERROR (thrift-server-pool-408114|30253806) [SRTThreadPoolServer$WorkerProcess.run():319] Thrift Error occurred during processing of message.