2.3.7超过mem_limit限制BE节点宕机

【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【StarRocks版本】例如:2.3.7
【集群规模】例如:3fe(1 follower+2observer)+5be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:16C/31G/万兆
【联系方式】
【附件】be.out
tcmalloc: large alloc 4294967296 bytes == 0x442978000 @ 0x59042ef 0x5b9595c 0x5b95cbe 0x211ad5e 0x210b93d 0x20b711d 0x2068df7 0x25c8be8 0x2c727df 0x2c72a26 0x2b7117c 0x29dbfde 0x2974658 0x296616c 0x295bfb6 0x21c59f9 0x21c15aa 0x7fc01b221dd5
query_id:766f7202-be8b-11ed-8aa0-00163e145fc0, fragment_instance:766f7202-be8b-11ed-8aa0-00163e145fc6
*** Aborted at 1678374185 (unix time) try “date -d @1678374185” if you are using GNU date ***
PC: @ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
*** SIGSEGV (@0x54eccc000) received by PID 12868 (TID 0x7fbf91af0700) from PID 1322041344; stack trace: ***
@ 0x40e1c82 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fc01b2295d0 (unknown)
@ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
@ 0x2c72a26 starrocks::vectorized::TDistinctAggregateFunction<>::merge()
@ 0x2b7117c starrocks::vectorized::NullableAggregateFunctionBase<>::merge_batch_single_state()
@ 0x29dbfde starrocks::Aggregator::compute_single_agg_state()
@ 0x2974658 starrocks::pipeline::AggregateBlockingSinkOperator::push_chunk()
@ 0x296616c starrocks::pipeline::PipelineDriver::process()
@ 0x295bfb6 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x21c59f9 starrocks::ThreadPool::dispatch_thread()
@ 0x21c15aa starrocks::thread::supervise_thread()
@ 0x7fc01b221dd5 start_thread
@ 0x7fc01a83d02d __clone
@ 0x0 (unknown)

BUG 复现:
1、SQL:

SQL

SELECT COUNT(DISTINCT ((dm_ec_cashloan_risk_diff_sence_monitor.p_date >= DATE(‘2022-12-01’)) AND (dm_ec_cashloan_risk_diff_sence_monitor.p_date <= DATE(‘2023-03-08’)))) AS ctd_Calculation_250583147382595594_ok,
COUNT(DISTINCT (CASE WHEN (dm_ec_cashloan_risk_diff_sence_monitor.af_user_type = ‘47’) THEN dm_ec_cashloan_risk_diff_sence_monitor.af_user_type WHEN (dm_ec_cashloan_risk_diff_sence_monitor.last_state_name = ‘SCORE_AFPIEMPTY_CLIK_RECALL’) THEN dm_ec_cashloan_risk_diff_sence_monitor.af_user_type ELSE ‘正常’ END)) AS ctd_Calculation_542402341543010329_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.af_user_type) AS ctd_af_user_type_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.afpi) AS ctd_afpi_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.all_pass_num) AS ctd_all_pass_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.bf_user_type) AS ctd_bf_user_type_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.canrepply_cnt) AS ctd_canrepply_cnt_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.done_num) AS ctd_done_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.is_risk_platform) AS ctd_is_risk_platform_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.last_state_name) AS ctd_last_state_name (复制)_323133334705078295_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.last_state_name) AS ctd_last_state_name_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.main_rank_level) AS ctd_main_rank_level_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.main_rank_name) AS ctd_main_rank_name_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.media_source) AS ctd_media_source_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.operation_level) AS ctd_operation_level_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.order_seq) AS ctd_order_seq_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.pass_num) AS ctd_pass_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.risk_flow_id) AS ctd_risk_flow_id_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.risk_platform_type_name) AS ctd_risk_platform_type_name_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.second_canrepply_cnt) AS ctd_second_canrepply_cnt_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.second_done_num) AS ctd_second_done_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.second_pass_num) AS ctd_second_pass_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.term_time_period) AS ctd_term_time_period_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.terms) AS ctd_terms_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.tixian_1_sorce_num) AS ctd_tixian_1_sorce_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.tixian_7_sorce_num) AS ctd_tixian_7_sorce_num_ok,
COUNT(DISTINCT dm_ec_cashloan_risk_diff_sence_monitor.user_num) AS ctd_user_num_ok
FROM ec_report.dm_ec_cashloan_risk_diff_sence_monitor
HAVING (COUNT(1) > 0);

2、BE 配置:
set global cbo_enable_low_cardinality_optimize = true ;
exec_mem_limit=2G
3、执行SQL
第一次执行:

日志

第二次执行
BE 挂2台

日志

tcmalloc: large alloc 4294967296 bytes == 0x442978000 @ 0x59042ef 0x5b9595c 0x5b95cbe 0x211ad5e 0x210b93d 0x20b711d 0x2068df7 0x25c8be8 0x2c727df 0x2c72a26 0x2b7117c 0x29dbfde 0x2974658 0x296616c 0x295bfb6 0x21c59f9 0x21c15aa 0x7fc01b221dd5
query_id:766f7202-be8b-11ed-8aa0-00163e145fc0, fragment_instance:766f7202-be8b-11ed-8aa0-00163e145fc6
*** Aborted at 1678374185 (unix time) try “date -d @1678374185” if you are using GNU date ***
PC: @ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
*** SIGSEGV (@0x54eccc000) received by PID 12868 (TID 0x7fbf91af0700) from PID 1322041344; stack trace: ***
@ 0x40e1c82 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fc01b2295d0 (unknown)
@ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
@ 0x2c72a26 starrocks::vectorized::TDistinctAggregateFunction<>::merge()
@ 0x2b7117c starrocks::vectorized::NullableAggregateFunctionBase<>::merge_batch_single_state()
@ 0x29dbfde starrocks::Aggregator::compute_single_agg_state()
@ 0x2974658 starrocks::pipeline::AggregateBlockingSinkOperator::push_chunk()
@ 0x296616c starrocks::pipeline::PipelineDriver::process()
@ 0x295bfb6 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x21c59f9 starrocks::ThreadPool::dispatch_thread()
@ 0x21c15aa starrocks::thread::supervise_thread()
@ 0x7fc01b221dd5 start_thread
@ 0x7fc01a83d02d __clone
@ 0x0 (unknown)

麻烦把be.out完整文件附件贴下,另外dmesg -T看下有没有oom的信息

exec_mem_limit=2G是单个instance的内存上限,一个sql会有多个示例。所以sql里的调用会比这个大,如果是混布的话需要在be.conf里设置mem_limit里防止oom,可以发下be.out我们看下宕机的堆栈么

tcmalloc: large alloc 4294967296 bytes == 0x442978000 @ 0x59042ef 0x5b9595c 0x5b95cbe 0x211ad5e 0x210b93d 0x20b711d 0x2068df7 0x25c8be8 0x2c727df 0x2c72a26 0x2b7117c 0x29dbfde 0x2974658 0x296616c 0x295bfb6 0x21c59f9 0x21c15aa 0x7fc01b221dd5
query_id:766f7202-be8b-11ed-8aa0-00163e145fc0, fragment_instance:766f7202-be8b-11ed-8aa0-00163e145fc6
*** Aborted at 1678374185 (unix time) try “date -d @1678374185” if you are using GNU date ***
PC: @ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
*** SIGSEGV (@0x54eccc000) received by PID 12868 (TID 0x7fbf91af0700) from PID 1322041344; stack trace: ***
@ 0x40e1c82 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fc01b2295d0 (unknown)
@ 0x2c728c0 starrocks::vectorized::DistinctAggregateState<>::deserialize_and_merge()
@ 0x2c72a26 starrocks::vectorized::TDistinctAggregateFunction<>::merge()
@ 0x2b7117c starrocks::vectorized::NullableAggregateFunctionBase<>::merge_batch_single_state()
@ 0x29dbfde starrocks::Aggregator::compute_single_agg_state()
@ 0x2974658 starrocks::pipeline::AggregateBlockingSinkOperator::push_chunk()
@ 0x296616c starrocks::pipeline::PipelineDriver::process()
@ 0x295bfb6 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x21c59f9 starrocks::ThreadPool::dispatch_thread()
@ 0x21c15aa starrocks::thread::supervise_thread()
@ 0x7fc01b221dd5 start_thread
@ 0x7fc01a83d02d __clone
@ 0x0 (unknown)