生产 2.5.5 版本be crash 宕机

【详述】be.out中有concat_ws这个函数,我查询中也有使用concat_ws函数,数据量很小,都是维度表,10分钟查询一次
【背景】今天中午12点13分左右宕机,常见crash已找过,没有相同情况
【业务影响】
【StarRocks版本】例如:2.5.5
【集群规模】例如:3fe +4be(fe与be混部)
【机器信息】 48C/250G/万兆
【联系方式】 社区群3-高显
【附件】
查询SQL:
truncate table ods.tmp_ent_port_tlns_frqc;
insert into ods.tmp_ent_port_tlns_frqc
SELECT
cast(substr(t.STRT_CD,4) as int) as dbct_cd,
cast(substr(t.DEST_CD,4) as int) as nxt_site_cd,
MAX(nullif(concat_ws(’:’,ONE_FRQC_ADJ_LST_SND_CAR_TM,‘00’),‘00’) ) one_frqc_snd_tm, --一频次发出时间
MAX(nullif(concat_ws(’:’,ONE_FRQC_FLD_TM,‘00’) ,‘00’))one_frqc_oper_dur,–一频次清场时长
MAX(nullif(concat_ws(’:’,ONE_FRQC_TLNS_END_PT,‘00’) ,‘00’))one_frqc_tlns_end_pt, --一频次时效截止点
MAX(nullif(concat_ws(’:’,TWO_FRQC_ADJ_LST_SND_CAR_TM,‘00’) ,‘00’)) two_frqc_snd_tm, --二频次发出时间
MAX(nullif(concat_ws(’:’,TWO_FRQC_FLD_TM,‘00’) ,‘00’)) two_frqc_oper_dur, --二频次清场时长
MAX(nullif(concat_ws(’:’,TWO_FRQC_TLNS_END_PT,‘00’) ,‘00’)) two_frqc_tlns_end_pt, --二频次时效截止点
t.vld_tm,t.end_tm as ivld_tm
FROM dim.dim_dbct_lst_snd_car_schd t
where end_tm>${start_12d}
GROUP by t.STRT_CD, t.DEST_CD, t.vld_tm,t.ivld_tm;

–分拨到网点频次发出时间
truncate table ods.tmp_d2b_frqc_snd_tm_uniq;
insert into ods.tmp_d2b_frqc_snd_tm_uniq
SELECT dbct_cd --分拨编号
,brch_cd --网点编号
,nullif(concat_ws(’:’,substring(fld_frqc_cd_one,1,2),substring(fld_frqc_cd_one,3,2),‘00’),‘00’) AS one_frqc_fld_tm --一频次发出时间
,nullif(concat_ws(’:’,substring(fld_frqc_cd_two,1,2),substring(fld_frqc_cd_two,3,2),‘00’),‘00’) AS two_frqc_fld_tm --二频次发出时间
,vld_tm --生效时间
,ivld_tm --失效时间
FROM dim.dim_brch_cfm_rcv_maint
where ivld_tm>${start_12d};

BE.OUT
start time: Tue May 16 15:48:26 CST 2023
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/d/p2/app/StarRocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/d/p2/app/StarRocks/be/lib/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
[warn] evbuffer_file_segment_materialize: mmap(2605, 0, 74) failed: No such device
2.5.5 RELEASE (build 24c1eca)
query_id:5a5a0b28-f9e9-11ed-9712-ecebb89ecd60, fragment_instance:5a5a0b28-f9e9-11ed-9712-ecebb89ecd61
tracker:process consumption: 85996248320
tracker:query_pool consumption: 942624
tracker:load consumption: 406368
tracker:metadata consumption: 4725757074
tracker:tablet_metadata consumption: 671472229
tracker:rowset_metadata consumption: 424329661
tracker:segment_metadata consumption: 1221427886
tracker:column_metadata consumption: 2408527298
tracker:tablet_schema consumption: 277925
tracker:segment_zonemap consumption: 1168807165
tracker:short_key_index consumption: 11496096
tracker:column_zonemap_index consumption: 1933431242
tracker:ordinal_index consumption: -844947048
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 9501057479
tracker:page_cache consumption: 52739893520
tracker:update consumption: 3913064668
tracker:chunk_allocator consumption: 2159307672
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1684901615 (unix time) try “date -d @1684901615” if you are using GNU date ***
PC: @ 0x2b073adf2d69 __memcpy_ssse3_back
*** SIGSEGV (@0x1000000f0) received by PID 141580 (TID 0x2b0a1fbfe700) from PID 240; stack trace: ***
@ 0x58f9dc2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x2b0739ca1852 os::Linux::chained_handler()
@ 0x2b0739ca8676 JVM_handle_linux_signal
@ 0x2b0739c9e653 signalHandler()
@ 0x2b073a3845d0 (unknown)
@ 0x2b073adf2d69 __memcpy_ssse3_back
@ 0x5009670 starrocks::vectorized::concat_ws_small()
@ 0x500ccc1 starrocks::vectorized::StringFunctions::concat_ws()
@ 0x3e32797 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
@ 0x3df45a8 starrocks::vectorized::VectorizedNullIfExpr<>::evaluate()
@ 0x381a06e starrocks::ExprContext::evaluate()
@ 0x2f92002 starrocks::pipeline::ProjectOperator::push_chunk()
@ 0x2cf7836 starrocks::pipeline::PipelineDriver::process()
@ 0x4f2ef23 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4924bf2 starrocks::ThreadPool::dispatch_thread()
@ 0x491f6ea starrocks::thread::supervise_thread()
@ 0x2b073a37cdd5 start_thread
@ 0x2b073ad9cead __clone
@ 0x0 (unknown)
start time: Wed May 24 13:14:30 CST 2023

这个SQL,每次查,能稳定复现吗?

单独把SQL拿出来查了几十次,没有发生宕机。

echo 60000 > /proc/sys/vm/max_map_count 调下这个参数

[warn] evbuffer_file_segment_materialize: mmap(2605, 0, 74) failed: No such device 怀疑可能是这个导致

现在这个参数值是 65530 ,改成60000?

image

我把concat_ws改成concat,再运行一段时间观察看看

echo 200000 > /proc/sys/vm/max_map_count

好的,我试试调整一下

这个命令我没有权限执行,我执行了sudo echo “vm.max_map_count = 200000” >> /etc/sysctl.conf 应该是等效的吧?

只改文件没用,需要sysctl -p 才能生效

sysctl -p 这个也执行了,我观察看看

调整后今天上午10点半另外一台机器又宕机 ,另外数据量是昨天执行的两倍,
改了SQL,这段SQL是执行代码中的一小段:
truncate table ods.tmp_ent_port_tlns_frqc;
insert into ods.tmp_ent_port_tlns_frqc
SELECT
cast(substr(t.STRT_CD,4) as int) as dbct_cd,
cast(substr(t.DEST_CD,4) as int) as nxt_site_cd,
MAX(case when ONE_FRQC_ADJ_LST_SND_CAR_TM is not null then concat(ONE_FRQC_ADJ_LST_SND_CAR_TM,’:’,‘00’) end) one_frqc_snd_tm,
MAX(case when ONE_FRQC_FLD_TM is not null then concat(ONE_FRQC_FLD_TM,’:’,‘00’) end)one_frqc_oper_dur,
MAX(case when ONE_FRQC_TLNS_END_PT is not null then concat(ONE_FRQC_TLNS_END_PT,’:’,‘00’) end)one_frqc_tlns_end_pt,
MAX(case when TWO_FRQC_ADJ_LST_SND_CAR_TM is not null then concat(TWO_FRQC_ADJ_LST_SND_CAR_TM,’:’,‘00’) end) two_frqc_snd_tm,
MAX(case when TWO_FRQC_FLD_TM is not null then concat(TWO_FRQC_FLD_TM,’:’,‘00’) end) two_frqc_oper_dur,
MAX(case when TWO_FRQC_TLNS_END_PT is not null then concat(TWO_FRQC_TLNS_END_PT,’:’,‘00’) end) two_frqc_tlns_end_pt,
t.vld_tm,t.end_tm as ivld_tm
FROM dim.dim_dbct_lst_snd_car_schd t (–这个表是greenplum的外部表)

BE.OUT :
start time: Tue May 16 15:36:24 CST 2023
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/d/p2/app/StarRocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/d/p2/app/StarRocks/be/lib/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
[warn] evbuffer_file_segment_materialize: mmap(4167, 0, 74) failed: No such device
2.5.5 RELEASE (build 24c1eca)
query_id:3c42d765-faa3-11ed-9712-ecebb89ecd60, fragment_instance:3c42d765-faa3-11ed-9712-ecebb89ecd61
tracker:process consumption: 158607687276
tracker:query_pool consumption: 67686453743
tracker:load consumption: 40336
tracker:metadata consumption: 5105087579
tracker:tablet_metadata consumption: 669229292
tracker:rowset_metadata consumption: 426347655
tracker:segment_metadata consumption: 1329814882
tracker:column_metadata consumption: 2679695750
tracker:tablet_schema consumption: 267884
tracker:segment_zonemap consumption: 1272609617
tracker:short_key_index consumption: 12517666
tracker:column_zonemap_index consumption: 2105348790
tracker:ordinal_index consumption: -866499616
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 6622752505
tracker:page_cache consumption: 52464753152
tracker:update consumption: 5610415267
tracker:chunk_allocator consumption: 1955993536
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1684981454 (unix time) try “date -d @1684981454” if you are using GNU date ***
PC: @ 0x2b407321d74c __memcpy_ssse3_back
*** SIGSEGV (@0x2b472f9ffff0) received by PID 35967 (TID 0x2b4237b05700) from PID 799014896; stack trace: ***
@ 0x58f9dc2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x2b40720ca852 os::Linux::chained_handler()
@ 0x2b40720d1676 JVM_handle_linux_signal
@ 0x2b40720c7653 signalHandler()
@ 0x2b40727ad5d0 (unknown)
@ 0x2b407321d74c __memcpy_ssse3_back
@ 0x4e9d058 starrocks::stream_load::OlapTableSink::_print_varchar_error_msg()
@ 0x4e9fb09 starrocks::stream_load::OlapTableSink::_validate_data()
@ 0x4eac9d3 starrocks::stream_load::OlapTableSink::send_chunk()
@ 0x4f1fd99 starrocks::pipeline::OlapTableSinkOperator::push_chunk()
@ 0x2cf7836 starrocks::pipeline::PipelineDriver::process()
@ 0x4f2ef23 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4924bf2 starrocks::ThreadPool::dispatch_thread()
@ 0x491f6ea starrocks::thread::supervise_thread()
@ 0x2b40727a5dd5 start_thread
@ 0x2b40731c5ead __clone
@ 0x0 (unknown)
start time: Thu May 25 10:57:06 CST 2023

这个SQL,是读外表,还是写外表?

读取外部表dim.dim_brch_cfm_rcv_maint 写入到starrocks中的表,我刚又调整了vm.max_map_count = 655360,再观察一下

那可能是其它原因,我们再查下。可以获取个Core文件吗。

这个怎么获取,命令发我

@U_1653974266322_1581
ulimit -c unlimited

/data/StarRocks/be/bin/stop_be.sh

/data/StarRocks/be/bin/start_be.sh --daemon

然后执行sql触发bug
再在执行 /data/StarRocks/be/bin/start_be.sh --daemon 这个命令的目录下找一下 core 文件
打包一下core文件
tar czvf core.xxx.tar.gz core.xxx
一般会比较大,最好能上传的 cos 或者云盘再发出来

调整vm.max_map_count 这个参数没有用,刚刚又挂了一台机器。sql放在shell里十分钟跑一次,每隔几个小时会挂一台机器

这个core文件我明天捕捉一下,这个sql不固定几个小时会导致be挂一次。