be节点挂了

为了更快的定位您的问题,请提供以下信息,谢谢
【业务影响】BI查询报表
【StarRocks版本】2.5.13
【集群规模】3fe+4be(fe与be混部)
【机器信息】128C/256G/万兆
【联系方式】

10003 10.1.100.90 9050 9060 7040 8060 2023-10-26 20:28:50 2023-11-04 11:08:07 true false false 8208 1.430 TB 43.116 TB 47.274 TB 8.80 % 10.12 % 2.5.13-a3b58a0 {“lastSuccessReportTabletsTime”:“2023-11-04 11:07:53”} 44.546 TB 3.21 % 128 0 30.74 % 0.0 %
10004 10.1.100.91 9050 9060 7040 8060 2023-10-26 20:28:55 2023-11-04 11:08:07 true false false 8208 1.430 TB 43.129 TB 47.274 TB 8.77 % 9.57 % 2.5.13-a3b58a0 {“lastSuccessReportTabletsTime”:“2023-11-04 11:07:36”} 44.559 TB 3.21 % 64 0 28.96 % 0.0 %
10005 10.1.100.93 9050 9060 7040 8060 2023-10-26 20:29:00 2023-11-04 11:08:07 true false false 8206 1.430 TB 43.110 TB 47.274 TB 8.81 % 10.14 % 2.5.13-a3b58a0 {“lastSuccessReportTabletsTime”:“2023-11-04 11:07:33”} 44.540 TB 3.21 % 128 0 30.85 % 0.0 %
1273626 10.1.100.99 9050 9060 7040 8060 2023-10-26 20:29:00 2023-11-03 16:55:03 false false false 2 1.074 TB 43.668 TB 47.274 TB 7.63 % 8.37 % java.net.ConnectException: Connection refused (Connection refused) 2.5.13-a3b58a0 {“lastSuccessReportTabletsTime”:“2023-11-03 16:54:14”} 44.742 TB 2.40 % 128 20 78.09 % 41.4 %

W1103 16:54:45.168536 18939 mem_hook.cpp:254] large memory alloc: 1443429680 bytes, stack:
@ 0x4ab551b malloc
@ 0x802b745 operator new()
@ 0x2d621ee std::vector<>::_M_range_insert<>()
@ 0x2d65664 starrocks::vectorized::BinaryColumnBase<>::append()
@ 0x50cb2fa starrocks::vectorized::NullableColumn::append()
@ 0x2e4ad63 starrocks::vectorized::JoinHashTable::append_chunk()
@ 0x3212a2e starrocks::vectorized::HashJoiner::append_chunk_to_ht()
@ 0x30c4249 starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@ 0x2d90fee starrocks::pipeline::PipelineDriver::process()
@ 0x51add6a starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4b968f2 starrocks::ThreadPool::dispatch_thread()
@ 0x4b9138a starrocks::thread::supervise_thread()
@ 0x7fb6e26fbea5 start_thread
@ 0x7fb6e1d1696d __clone
@ (nil) (unknown)
W1103 16:54:47.087438 21050 runtime_filter_worker.cpp:272] brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=400ms @10.1.100.90:8060
W1103 16:54:52.495692 21050 runtime_filter_worker.cpp:272] brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=400ms @10.1.100.90:8060
W1103 16:54:52.497540 21050 runtime_filter_worker.cpp:272] brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=400ms @10.1.100.93:8060
W1103 16:54:59.387981 18973 mem_hook.cpp:254] large memory alloc: 1523232054 bytes, stack:
@ 0x4ab551b malloc
@ 0x802b745 operator new()
@ 0x2d621ee std::vector<>::_M_ran

发下 be.out 中的 crash堆栈

be.out (1.7 KB)

这个版本自升级以后,挂了好多次了,上次又是查询超出内存限制挂了,稳定性不太行的样子

dmesg -T 的结果发下,应该是OOM了,配置不合理导致

cat /proc/sys/vm/overcommit_memory 也发下

dmesg.log (335.6 KB)

image

看dmesg -T 是OOM了,需要提供挂之前的be.info日志,分析下内存为什么失控了。grep 下Current memory

be-info.log (12.4 MB)

image
StarRocks日志太大了,每次上传都很不方便,能分段就好了

刚刚好像没有回复到,be-info.log (12.4 MB)

不过很奇怪啊,集群只挂了一个be,BI就查询不了了报错,不至于呀,为啥呢
导出日志
出错节点ID: [ 48b06869-67cd-53b5-50a6-6b35f5cfaa72 ]
java.util.concurrent.ExecutionException: com.finebi.common.exception.conf.table.FineSqlErrorException: 错误代码:62400001Backend not found. Check if any backend is down or not. backend: [10.1.100.99 alive: false inBlacklist: false] [10.1.100.90 alive: true inBlacklist: false] [10.1.100.91 alive: true inBlacklist: false] [10.1.100.93 alive: true inBlacklist: false]

I1103 16:54:57.829267 17253 daemon.cpp:188] Current memory statistics: process(189697741772), query_pool(157026921882), load(0), metadata(996139437), compaction(0), schema_change(0), column_pool(12855290879), page_cache(2376319520), update(0), chunk_allocator(1086285200), clone(0), consistency(0) 这个时间点附近有 large memory alloc的日志吗

分配内存超过10G的

I1103 16:49:40 到 16:55的日志,发下

看日志是一个大查询没防住,要分析下原因

info.log (1006.6 KB)

但是有个问题,建的表都是有副本的,为什么只是挂了个be节点就查询报错,这样不就保证不了高可用

这个不符合预期,需要具体看下