资源隔离导致be宕机

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】资源隔离,其中一个组超过内存节点就挂载
【业务影响】 be宕机
【StarRocks版本】2.1.2

有没有大佬设置资源隔离组,其中一个组的查询内存过大,直接就导致be宕机了,按文档不是应该查询报错吗?

2.1.2太旧了,不维护了,建议您使用LTS2.5最新版本

抱歉,版本是3.1.2

麻烦提供一下be crash时be.out的日志

[Error] File: 3.1.2 RELEASE (build 4f3a2ee)
query_id:bf8d6335-95b4-11ee-bc93-fefcfe47a241, fragment_instance:bf8d6335-95b4-11ee-bc93-fefcfe47a244
tracker:process consumption: 16800861856
tracker:query_pool consumption: 2878779132
tracker:load consumption: 72608
tracker:metadata consumption: 332554216
tracker:tablet_metadata consumption: 25672117
tracker:rowset_metadata consumption: 88328830
tracker:segment_metadata consumption: 20404591
tracker:column_metadata consumption: 198148678
tracker:tablet_schema consumption: 3571797
tracker:segment_zonemap consumption: 13802657
tracker:short_key_index consumption: 1018659
tracker:column_zonemap_index consumption: 27521158
tracker:ordinal_index consumption: 96861472
tracker:bitmap_index consumption: 17600
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 1332808275
tracker:page_cache consumption: 10106869440
tracker:update consumption: 8762102
tracker:chunk_allocator consumption: 1729107128
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1702031408 (unix time) try “date -d @1702031408” if you are using GNU date ***
PC: @ 0x7630468 opentelemetry::v1::sdk::trace::Span::SetAttribute()
*** SIGSEGV (@0x0) received by PID 132276 (TID 0x7f9622ccc700) from PID 0; stack trace: ***
@ 0x6033302 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f96dc8451a2 os::Linux::chained_handler()
@ 0x7f96dc84b826 JVM_handle_linux_signal
@ 0x7f96dc841e13 signalHandler()
@ 0x7f96dbd13630 (unknown)
@ 0x7630468 opentelemetry::v1::sdk::trace::Span::SetAttribute()
@ 0x52ae0f5 starrocks::stream_load::OlapTableSink::close_wait()
@ 0x52aefc3 starrocks::stream_load::OlapTableSink::close()
@ 0x53364b8 starrocks::pipeline::OlapTableSinkOperator::pending_finish()
@ 0x2914db2 starrocks::pipeline::PipelineDriver::_check_fragment_is_canceled()
@ 0x29154e8 starrocks::pipeline::PipelineDriver::process()
@ 0x534d90e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4c4df02 starrocks::ThreadPool::dispatch_thread()
@ 0x4c489fa starrocks::thread::supervise_thread()
@ 0x7f96dbd0bea5 start_thread
@ 0x7f96db32696d __clone
@ 0x0 (unknown)

be.INFO找到对应日志
W1208 18:30:08.448513 136140 pipeline_driver_executor.cpp:161] [Driver] Process error, query_id=c24d6ce6-95b4-11ee-bc93-fefcfe47a241, instance_id=c24d6ce6-95b4-11ee-bc93-fefcfe47a243, status=Memory limit exceeded: Memory of ecdb_role exceed limit. read and decompress page Used: 3679353017, Limit: 3678390094. Mem usage has exceed the limit of the resource group [ecdb_role]. You can change the limit by modifying [mem_limit] of this group
/build/starrocks/be/src/storage/rowset/page_io.cpp:135 CurrentThread::mem_tracker()->check_mem_limit(“read and decompress page”)
/build/starrocks/be/src/storage/rowset/scalar_column_iterator.cpp:292 _reader->read_page(_opts, iter.page(), &handle, &page_body, &footer)
/build/starrocks/be/src/storage/rowset/scalar_column_iterator.cpp:252 _read_data_page(_page_iter)
/build/starrocks/be/src/storage/rowset/scalar_column_iterator.cpp:201 _load_next_page(&eos)
/build/starrocks/be/src/storage/rowset/segment_iterator.cpp:134 _column_iterators[i]->next_batch(range, col.get())
/build/starrocks/be/src/storage/rowset/segment_iterator.cpp:931 _context->read_columns(chunk, range)
/build/starrocks/be/src/storage/rowset/segment_iterator.cpp:1003 _read(chunk, rowid, chunk_capacity - chunk_start)
/build/starrocks/be/src/storage/tablet_reader.cpp:232 _collect_iter->get_next(chunk)
/build/starrocks/be/src/exec/pipeline/scan/olap_chunk_source.cpp:369 _prj_iter->get_next(chunk)
/build/starrocks/be/src/exec/pipeline/scan/scan_operator.cpp:226 _get_scan_status()
W1208 18:30:08.440775 136145 pipeline_driver_executor.cpp:161] [Driver] Process error, query_id=bf8d6335-95b4-11ee-bc93-fefcfe47a241, instance_id=bf8d6335-95b4-11ee-bc93-fefcfe47a243, status=Memory limit exceeded: Memory of ecdb_role exceed limit. Pipeline Backend: 172.16.0.22, fragment: bf8d6335-95b4-11ee-bc93-fefcfe47a243 Used: 3679841506, Limit: 3678390094. Mem usage has exceed the limit of the resource group [ecdb_role]. You can change the limit by modifying [mem_limit] of this group

您的资源隔离组是如何设置的?

ecdb_role里面有2个用户

这个SQL 每次执行都会be crash吗?方便发一下吗?

数据大概几百万行,并没有很多。主要问题是资源组内存不够不应该导致be crash,如果会出现be crash那资源隔离似乎没啥作用了。
SELECT
s.*,
m.fina_warehouse_name
FROM
report_my_order_logistics_statistics s
INNER JOIN my_supplier_warehouse_mapping m on s.fk_warehouse_code=m.ck_code
WHERE
is_lanshou_report = 1
AND is_copy =0

cat /proc/sys/vm/overcommit_memory 这个要设置1
如果混部,内存不足,注意把BE内存调小mem_limit ,给FE预留充足的内存。
dmesg -T | grep oom 有没有报错

配置看着都正常
image
image
image

服务器是混部吗?配置多少?

fe和be混合部署,3台服务器,每个服务器12C48G,每台服务器上面1个fe,1个be

怀疑是bug,麻烦提供一下crash的be的be.INFO日志

日志太大 可以通过企业微信传输,您的企业微信号是多少?我加一下。