查询导致BE大规模奔溃

【背景】查询 hive外表1 天,数据(6G),每次查询都会导致集群 Be节点几乎全部奔溃
【StarRocks版本】3.3.0-re01
【集群规模】70 be
【机器信息】k8s 集群,BE 规格10 core/30G


这个帖子和我们的情况相似,用物理机集群 3.1.11版本查询则没有这个问题
是否是未知的BUG?或是 k8s 集群需要特殊配置?帮忙看下 @yuchen1019 @许秀不许秀

BE 节点 be.out 信息如下
query_id:ff9b157c-2403-11ef-b507-9a86be1a17f9, fragment_instance:ff9b157c-2403-11ef-b507-9a86be1a1857
Hive file path: 000001_0.gz, partition id: 0, length: 67108864, offset: 134217728
tracker:process consumption: 160997336
tracker:query_pool consumption: 34608064
tracker:query_pool/connector_scan consumption: 541753344
tracker:load consumption: 0
tracker:metadata consumption: 2152
tracker:tablet_metadata consumption: 2152
tracker:rowset_metadata consumption: 0
tracker:segment_metadata consumption: 0
tracker:column_metadata consumption: 0
tracker:tablet_schema consumption: 784
tracker:segment_zonemap consumption: 0
tracker:short_key_index consumption: 0
tracker:column_zonemap_index consumption: 0
tracker:ordinal_index consumption: 0
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 0
tracker:jit_cache consumption: 0
tracker:update consumption: 0
tracker:chunk_allocator consumption: 0
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0

*** Aborted at 1717678507 (unix time) try “date -d @1717678507” if you are using GNU date ***
PC: @ 0x6fa6714 starrocks::CSVReader::buff_capacity()
*** SIGSEGV (@0x98) received by PID 128 (TID 0x7f1faf12b640) from PID 152; stack trace: ***
@ 0x9854bba google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f204629e520 (unknown)
@ 0x6fa6714 starrocks::CSVReader::buff_capacity()
@ 0x6fa0896 starrocks::HdfsTextScanner::estimated_mem_usage()
@ 0x72d5872 starrocks::pipeline::ConnectorChunkSource::close()
@ 0x5255034 starrocks::pipeline::ScanOperator::_close_chunk_source_unlocked()
@ 0x5253d4f starrocks::pipeline::ScanOperator::_finish_chunk_source_task()
@ 0x5258fde ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlRT_E_clINS_9workgroup12YieldContextEEEDaS5.constprop.0
@ 0x543a72b starrocks::workgroup::ScanExecutor::worker_thread()
@ 0x82d2d7c starrocks::ThreadPool::dispatch_thread()
@ 0x82cc22a starrocks::thread::supervise_thread()
@ 0x7f20462f0ac3 (unknown)
@ 0x7f2046381a04 clone
@ 0x0 (unknown)
(END)

Explain (7.1 KB)

可能是这个问题: https://github.com/StarRocks/starrocks/pull/46372

升级3.3.0-rc02试试

@trueeyu 升级后还是奔溃,dump堆栈一样,starrocks::CSVReader::buff_capacity()

我私聊你了,加个联系方式

1赞

@丹尼尔007 @trueeyu 感谢帮助 修复链接我贴在这里https://github.com/StarRocks/starrocks/pull/46830

这个问题2.5版本有修复吗?

遇到另一个导致 BE奔溃的问题, 帮忙看下@丹尼尔007
*** Aborted at 1718856824 (unix time) try “date -d @1718856824” if you are using GNU date ***
PC: @ 0x661f6eb starrocks::parquet::ScalarColumnReader::fill_dst_column()
*** SIGSEGV (@0x0) received by PID 138 (TID 0x7f65204f4640) from PID 0; stack trace: ***
@ 0x6ed86b2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f65d37f8256 os::Linux::chained_handler()
@ 0x7f65d37fdf4b JVM_handle_linux_signal
@ 0x7f65d37f0a8c signalHandler()
@ 0x7f65d290a520 (unknown)
@ 0x661f6eb starrocks::parquet::ScalarColumnReader::fill_dst_column()
@ 0x6618478 starrocks::parquet::GroupReader::_fill_dst_chunk()
@ 0x6618c0d starrocks::parquet::GroupReader::get_next()
@ 0x65ed849 starrocks::parquet::FileReader::get_next()
@ 0x6440cdc starrocks::HdfsParquetScanner::do_get_next()
@ 0x6431b35 starrocks::HdfsScanner::get_next()
@ 0x63c4917 starrocks::connector::HiveDataSource::get_next()
@ 0x3cc8f22 starrocks::pipeline::ConnectorChunkSource::_read_chunk()
@ 0x3fecd0f starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
@ 0x3cb9219 ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlRT_E_clINS_9workgroup12YieldContextEEEDaS5.constprop.0
@ 0x3dc8c8e starrocks::workgroup::ScanExecutor::worker_thread()
@ 0x330891c starrocks::ThreadPool::dispatch_thread()
@ 0x33025aa starrocks::thread::supervise_thread()
@ 0x7f65d295cac3 (unknown)
@ 0x7f65d29eda04 clone
@ 0x0 (unknown)
(END)


这里修复了
1赞