BE 节点异常挂了

【详述】BE 节点挂了
【背景】正常运行1个月了,忽然be挂了,fe正常
【业务影响】
【StarRocks版本】2.0.4
【集群规模】1 台机器(fe与be混部)
【机器信息】4c/8G(fe=2G,be=6G)
【附件】

  • fe.warn.log

  • be.warn.log(dmesg -T 没看到 Out of memory: Kill process 日志)

I0602 18:28:05.104640 59470 internal_service.cpp:241] exec plan fragment, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6, coord=TNetworkAddress(hostname=172.18.0.1, port=9020), backend=1 is_pipeline 0
I0602 18:28:05.104677 59470 plan_fragment_executor.cpp:70] Prepare(): query_id=b01a2802-e25e-11ec-8736-0242250eaf97 fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6 backend_num=1
I0602 18:28:05.105054 59323 plan_fragment_executor.cpp:205] Open(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6

E0602 18:28:05.881072 59257 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.880582 59252 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.906898 59250 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
W0602 18:28:05.935935 59364 plan_fragment_executor.cpp:210] fail to open fragment, instance_id=b01a2802-e25e-11ec-8736-0242250eafa5, status=Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.939838 59276 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
I0602 18:28:06.002171 59470 internal_service.cpp:284] cancel framgent, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eaf98, reason: InternalError
I0602 18:28:06.002192 59470 plan_fragment_executor.cpp:389] cancel(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eaf98
I0602 18:28:06.002172 59472 internal_service.cpp:284] cancel framgent, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6, reason: InternalError
I0602 18:28:06.002200 59472 plan_fragment_executor.cpp:389] cancel(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6
W0602 18:28:06.002228 59364 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eafa5: Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
W0602 18:28:06.002770 59323 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eafa6: Cancelled: Cancelled SenderQueue::get_chunk
W0602 18:28:06.002836 59311 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eaf98: Cancelled: Cancelled SenderQueue::get_chunk
I0602 18:28:06.002926 59084 tablet_sink.cpp:1103] Exiting consumer thread, no running channel
I0602 18:28:06.003011 59472 load_channel_mgr.cpp:301] Cancelled load channel load id=b01a2802e25e11ec-87360242250eaf97
I0602 18:28:06.003018 59472 load_channel.cpp:40] load channel mem peak usage=0, info=limit: 1603517368; consumption: 0; label: b01a2802e25e11ec-87360242250eaf97; all tracker size: 3; limit trackers size: 3; parent is null: false; , load_id=b01a2802e25e11ec-87360242250eaf97
I0602 18:28:06.003093 59323 plan_fragment_executor.cpp:471] Fragment b01a2802-e25e-11ec-8736-0242250eafa6:(Active: 897.233ms, non-child: 0.00%)
AverageThreadTokens: 1.00
MemoryLimit: 2.00 GB
PeakMemoryUsage: 0
RowsProduced: 0
DataStreamSender (dst_id=45, dst_fragments=[b01a2802e25e11ec-87360242250eaf98]):(Active: 232.105us, non-child: 0.03%)
PartType: RANDOM
BytesSent: 0
CompressTime: 0.000ns
IgnoreRows: 0
OverallThroughput: 0.00 /sec
SendRequestTime: 27.226us
SerializeBatchTime: 0.000ns
ShuffleDispatchTime: 0.000ns
ShuffleHashTime: 0.000ns
UncompressedBytes: 0
WaitResponseTime: 157.673us
PROJECT_NODE (id=44):(Active: 897.332ms, non-child: 0.01%)
CommonSubExprComputeTime: 0.000ns
ExprComputeTime: 0.000ns
PeakMemoryUsage: 0
RowsReturned: 0
RowsReturnedRate: 0
AGGREGATION_NODE (id=43):(Active: 897.247ms, non-child: 0.00%)
AggregateFunctions: max(161: max), min(162: min), count(157: count), sum(158: sum), approx_count_distinct(159: approx_count_distinct), count(160: count)
AggComputeTime: 0.000ns
ExprComputeTime: 0.000ns
ExprReleaseTime: 0.000ns
GetResultsTime: 0.000ns
HashTableSize: 0
InputRowCount: 0
PassThroughRowCount: 0
PeakMemoryUsage: 0
ResultAggAppendTime: 0.000ns
ResultGroupByAppendTime: 0.000ns
ResultIteratorTime: 0.000ns
RowsReturned: 0
RowsReturnedRate: 0
StreamingTime: 0.000ns
EXCHANGE_NODE (id=42):(Active: 897.223ms, non-child: 100.00%)
BytesReceived: 0
DecompressRowBatchTimer: 0.000ns
DeserializeRowBatchTimer: 0.000ns
PeakMemoryUsage: 0
RequestReceived: 0
RowsReturned: 0
RowsReturnedRate: 0
SenderTotalTime: 0.000ns
SenderWaitLockTime: 0.000ns
I0602 18:28:06.003255 59311 plan_fragment_executor.cpp:471] Fragment b01a2802-e25e-11ec-8736-0242250eaf98:(Active: 898.104ms, non-child: 0.00%)
AverageThreadTokens: 1.00
MemoryLimit: 2.00 GB
PeakMemoryUsage: 22.62 KB
RowsProduced: 6
OlapTableSink:(Active: 411.956us, non-child: 0.05%)
CloseWaitTime: 0.000ns
ConvertBatchTime: 0.000ns
NonBlockingSendTime: 0.000ns
OpenTime: 314.027us
RowsFiltered: 0
RowsRead: 0
RowsReturned: 0
SendDataTime: 0.000ns
SerializeBatchTime: 0.000ns
ValidateDataTime: 0.000ns
PROJECT_NODE (id=46):(Active: 898.110ms, non-child: 0.02%)
CommonSubExprComputeTime: 388.000ns
ExprComputeTime: 23.987us
PeakMemoryUsage: 0
RowsReturned: 6
RowsReturnedRate: 6.00 /sec
UNION_NODE (id=0):(Active: 897.966ms, non-child: 0.01%)
PeakMemoryUsage: 0
RowsReturned: 6
RowsReturnedRate: 6.00 /sec
EXCHANGE_NODE (id=13):(Active: 5.520ms, non-child: 0.61%)
BytesReceived: 171.00 B
DecompressRowBatchTimer: 1.703us
DeserializeRowBatchTimer: 12.204us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 181.00 /sec
SenderTotalTime: 20.896us
SenderWaitLockTime: 179.000ns
EXCHANGE_NODE (id=20):(Active: 1.651us, non-child: 0.00%)
BytesReceived: 162.00 B
DecompressRowBatchTimer: 1.194us
DeserializeRowBatchTimer: 11.131us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 605.69 K/sec
SenderTotalTime: 16.009us
SenderWaitLockTime: 130.000ns
EXCHANGE_NODE (id=26):(Active: 1.335us, non-child: 0.00%)
BytesReceived: 155.00 B
DecompressRowBatchTimer: 701.000ns
DeserializeRowBatchTimer: 8.803us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 749.06 K/sec
SenderTotalTime: 12.049us
SenderWaitLockTime: 112.000ns
EXCHANGE_NODE (id=32):(Active: 795.000ns, non-child: 0.00%)
BytesReceived: 163.00 B
DecompressRowBatchTimer: 853.000ns
DeserializeRowBatchTimer: 15.771us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 1.26 M/sec
SenderTotalTime: 20.173us
SenderWaitLockTime: 165.000ns
EXCHANGE_NODE (id=38):(Active: 1.672us, non-child: 0.00%)
BytesReceived: 170.00 B
DecompressRowBatchTimer: 663.000ns
DeserializeRowBatchTimer: 9.566us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 598.09 K/sec
SenderTotalTime: 13.123us
SenderWaitLockTime: 87.000ns
EXCHANGE_NODE (id=45):(Active: 123.760ms, non-child: 13.78%)
BytesReceived: 0
DecompressRowBatchTimer: 0.000ns
DeserializeRowBatchTimer: 0.000ns
PeakMemoryUsage: 0
RequestReceived: 0
RowsReturned: 0
RowsReturnedRate: 0
SenderTotalTime: 0.000ns
SenderWaitLockTime: 0.000ns
EXCHANGE_NODE (id=6):(Active: 768.610ms, non-child: 85.58%)
BytesReceived: 185.00 B
DecompressRowBatchTimer: 1.603us
DeserializeRowBatchTimer: 17.464us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 1.00 /sec
SenderTotalTime: 26.295us
SenderWaitLockTime: 176.000ns

  • 慢查询:
    • Profile信息
    • 并行度:2;
    • cbo是否开启:是
    • be节点cpu和内存使用率截图
      total used free shared buff/cache available
      Mem: 7551 6278 160 0 1112 1016

看上面日志是不是 计算统计信息的时候,内存不够会导致挂?

您好 方便提供下当时时间节点的be.out日志吗? 我追查下堆栈信息

terminate called recursively
terminate called after throwing an instance of ‘terminate called recursively
*** Aborted at 1654165685 (unix time) try “date -d @1654165685” if you are using GNU date ***
std::bad_alloc’
what(): std::bad_alloc
PC: @ 0x7f23fae4e387 __GI_raise
*** SIGABRT (@0x3ea0000e71e) received by PID 59166 (TID 0x7f23debc9700) from PID 59166; stack trace: ***
@ 0x33b8242 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f23fbb19630 (unknown)
@ 0x7f23fae4e387 __GI_raise
@ 0x7f23fae4fa78 __GI_abort
@ 0x4d02122 __gnu_cxx::__verbose_terminate_handler()
@ 0x4d00b96 __cxxabiv1::__terminate()
@ 0x4d00c01 std::terminate()
@ 0x4d00d54 __cxa_throw
@ 0x1582298 _Znwm.cold
@ 0x170f61e std::vector<>::_M_range_insert<>()
@ 0x170acd5 starrocks::vectorized::BinaryColumn::append_continuous_strings()
@ 0x17030fe starrocks::vectorized::NullableColumn::append_continuous_strings()
@ 0x1b04775 starrocks::segment_v2::BinaryPlainPageDecoder<>::next_batch()
@ 0x1abed77 starrocks::segment_v2::ParsedPageV2::read()
@ 0x2cbc0a8 starrocks::segment_v2::ScalarColumnIterator::next_batch()
@ 0x2c24cf1 starrocks::vectorized::SegmentIterator::_do_get_next()
@ 0x2c28521 starrocks::vectorized::SegmentIterator::do_get_next()
@ 0x1b14eaa starrocks::SegmentIteratorWrapper::do_get_next()
@ 0x189b37b starrocks::vectorized::TimedChunkIterator::do_get_next()
@ 0x18d122a starrocks::vectorized::TabletReader::do_get_next()
@ 0x2369804 starrocks::vectorized::TabletScanner::get_chunk()
@ 0x20eb24b starrocks::vectorized::OlapScanNode::_scanner_thread()
@ 0x1b3ccad starrocks::PriorityThreadPool::work_thread()
@ 0x3359a87 thread_proxy
@ 0x7f23fbb11ea5 start_thread
@ 0x7f23faf16b0d __clone
@ 0x0 (unknown)
terminate called recursively

是这个嘛?

有发现是什么问题嘛?

不好意思 ,一直没回复 ,定位到可能是合局字典优化导致的,可以先 set global cbo_enable_low_cardinality_optimize = false;如果有OOM的话, 并且是FE/BE混部的话, 配置下be.conf中的mem_limit=(机器内存 减去 预留给FE的内存),新版本已经有了修复,会合入到2.0.8版本,建议是升级到2.0.8的版本 (LTS版本)

非常感谢,mem_limit 本来就已经设置了的= 75% ,所有8G内存分成 fe=2G,be=6G
我想问下: 如果是全局字典的问题,我想问下啥时候会用这个?我的表没建 bitmap和BF 索引

能问下2.0.8 发布日期有定嘛