【详述】BE 节点挂了
【背景】正常运行1个月了,忽然be挂了,fe正常
【业务影响】
【StarRocks版本】2.0.4
【集群规模】1 台机器(fe与be混部)
【机器信息】4c/8G(fe=2G,be=6G)
【附件】
-
fe.warn.log
-
be.warn.log(dmesg -T 没看到 Out of memory: Kill process 日志)
I0602 18:28:05.104640 59470 internal_service.cpp:241] exec plan fragment, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6, coord=TNetworkAddress(hostname=172.18.0.1, port=9020), backend=1 is_pipeline 0
I0602 18:28:05.104677 59470 plan_fragment_executor.cpp:70] Prepare(): query_id=b01a2802-e25e-11ec-8736-0242250eaf97 fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6 backend_num=1
I0602 18:28:05.105054 59323 plan_fragment_executor.cpp:205] Open(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6
E0602 18:28:05.881072 59257 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.880582 59252 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.906898 59250 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
W0602 18:28:05.935935 59364 plan_fragment_executor.cpp:210] fail to open fragment, instance_id=b01a2802-e25e-11ec-8736-0242250eafa5, status=Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
E0602 18:28:05.939838 59276 olap_scan_node.cpp:255] [TUniqueId(hi=-5757245180786896404, lo=-8703766746734088297)] Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
I0602 18:28:06.002171 59470 internal_service.cpp:284] cancel framgent, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eaf98, reason: InternalError
I0602 18:28:06.002192 59470 plan_fragment_executor.cpp:389] cancel(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eaf98
I0602 18:28:06.002172 59472 internal_service.cpp:284] cancel framgent, fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6, reason: InternalError
I0602 18:28:06.002200 59472 plan_fragment_executor.cpp:389] cancel(): fragment_instance_id=b01a2802-e25e-11ec-8736-0242250eafa6
W0602 18:28:06.002228 59364 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eafa5: Invalid argument: Fail to do LZ4FRAME decompress, res=ERROR_allocation_failed
W0602 18:28:06.002770 59323 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eafa6: Cancelled: Cancelled SenderQueue::get_chunk
W0602 18:28:06.002836 59311 fragment_mgr.cpp:194] Fail to open fragment b01a2802-e25e-11ec-8736-0242250eaf98: Cancelled: Cancelled SenderQueue::get_chunk
I0602 18:28:06.002926 59084 tablet_sink.cpp:1103] Exiting consumer thread, no running channel
I0602 18:28:06.003011 59472 load_channel_mgr.cpp:301] Cancelled load channel load id=b01a2802e25e11ec-87360242250eaf97
I0602 18:28:06.003018 59472 load_channel.cpp:40] load channel mem peak usage=0, info=limit: 1603517368; consumption: 0; label: b01a2802e25e11ec-87360242250eaf97; all tracker size: 3; limit trackers size: 3; parent is null: false; , load_id=b01a2802e25e11ec-87360242250eaf97
I0602 18:28:06.003093 59323 plan_fragment_executor.cpp:471] Fragment b01a2802-e25e-11ec-8736-0242250eafa6:(Active: 897.233ms, non-child: 0.00%)
AverageThreadTokens: 1.00
MemoryLimit: 2.00 GB
PeakMemoryUsage: 0
RowsProduced: 0
DataStreamSender (dst_id=45, dst_fragments=[b01a2802e25e11ec-87360242250eaf98]):(Active: 232.105us, non-child: 0.03%)
PartType: RANDOM
BytesSent: 0
CompressTime: 0.000ns
IgnoreRows: 0
OverallThroughput: 0.00 /sec
SendRequestTime: 27.226us
SerializeBatchTime: 0.000ns
ShuffleDispatchTime: 0.000ns
ShuffleHashTime: 0.000ns
UncompressedBytes: 0
WaitResponseTime: 157.673us
PROJECT_NODE (id=44):(Active: 897.332ms, non-child: 0.01%)
CommonSubExprComputeTime: 0.000ns
ExprComputeTime: 0.000ns
PeakMemoryUsage: 0
RowsReturned: 0
RowsReturnedRate: 0
AGGREGATION_NODE (id=43):(Active: 897.247ms, non-child: 0.00%)
AggregateFunctions: max(161: max), min(162: min), count(157: count), sum(158: sum), approx_count_distinct(159: approx_count_distinct), count(160: count)
AggComputeTime: 0.000ns
ExprComputeTime: 0.000ns
ExprReleaseTime: 0.000ns
GetResultsTime: 0.000ns
HashTableSize: 0
InputRowCount: 0
PassThroughRowCount: 0
PeakMemoryUsage: 0
ResultAggAppendTime: 0.000ns
ResultGroupByAppendTime: 0.000ns
ResultIteratorTime: 0.000ns
RowsReturned: 0
RowsReturnedRate: 0
StreamingTime: 0.000ns
EXCHANGE_NODE (id=42):(Active: 897.223ms, non-child: 100.00%)
BytesReceived: 0
DecompressRowBatchTimer: 0.000ns
DeserializeRowBatchTimer: 0.000ns
PeakMemoryUsage: 0
RequestReceived: 0
RowsReturned: 0
RowsReturnedRate: 0
SenderTotalTime: 0.000ns
SenderWaitLockTime: 0.000ns
I0602 18:28:06.003255 59311 plan_fragment_executor.cpp:471] Fragment b01a2802-e25e-11ec-8736-0242250eaf98:(Active: 898.104ms, non-child: 0.00%)
AverageThreadTokens: 1.00
MemoryLimit: 2.00 GB
PeakMemoryUsage: 22.62 KB
RowsProduced: 6
OlapTableSink:(Active: 411.956us, non-child: 0.05%)
CloseWaitTime: 0.000ns
ConvertBatchTime: 0.000ns
NonBlockingSendTime: 0.000ns
OpenTime: 314.027us
RowsFiltered: 0
RowsRead: 0
RowsReturned: 0
SendDataTime: 0.000ns
SerializeBatchTime: 0.000ns
ValidateDataTime: 0.000ns
PROJECT_NODE (id=46):(Active: 898.110ms, non-child: 0.02%)
CommonSubExprComputeTime: 388.000ns
ExprComputeTime: 23.987us
PeakMemoryUsage: 0
RowsReturned: 6
RowsReturnedRate: 6.00 /sec
UNION_NODE (id=0):(Active: 897.966ms, non-child: 0.01%)
PeakMemoryUsage: 0
RowsReturned: 6
RowsReturnedRate: 6.00 /sec
EXCHANGE_NODE (id=13):(Active: 5.520ms, non-child: 0.61%)
BytesReceived: 171.00 B
DecompressRowBatchTimer: 1.703us
DeserializeRowBatchTimer: 12.204us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 181.00 /sec
SenderTotalTime: 20.896us
SenderWaitLockTime: 179.000ns
EXCHANGE_NODE (id=20):(Active: 1.651us, non-child: 0.00%)
BytesReceived: 162.00 B
DecompressRowBatchTimer: 1.194us
DeserializeRowBatchTimer: 11.131us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 605.69 K/sec
SenderTotalTime: 16.009us
SenderWaitLockTime: 130.000ns
EXCHANGE_NODE (id=26):(Active: 1.335us, non-child: 0.00%)
BytesReceived: 155.00 B
DecompressRowBatchTimer: 701.000ns
DeserializeRowBatchTimer: 8.803us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 749.06 K/sec
SenderTotalTime: 12.049us
SenderWaitLockTime: 112.000ns
EXCHANGE_NODE (id=32):(Active: 795.000ns, non-child: 0.00%)
BytesReceived: 163.00 B
DecompressRowBatchTimer: 853.000ns
DeserializeRowBatchTimer: 15.771us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 1.26 M/sec
SenderTotalTime: 20.173us
SenderWaitLockTime: 165.000ns
EXCHANGE_NODE (id=38):(Active: 1.672us, non-child: 0.00%)
BytesReceived: 170.00 B
DecompressRowBatchTimer: 663.000ns
DeserializeRowBatchTimer: 9.566us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 598.09 K/sec
SenderTotalTime: 13.123us
SenderWaitLockTime: 87.000ns
EXCHANGE_NODE (id=45):(Active: 123.760ms, non-child: 13.78%)
BytesReceived: 0
DecompressRowBatchTimer: 0.000ns
DeserializeRowBatchTimer: 0.000ns
PeakMemoryUsage: 0
RequestReceived: 0
RowsReturned: 0
RowsReturnedRate: 0
SenderTotalTime: 0.000ns
SenderWaitLockTime: 0.000ns
EXCHANGE_NODE (id=6):(Active: 768.610ms, non-child: 85.58%)
BytesReceived: 185.00 B
DecompressRowBatchTimer: 1.603us
DeserializeRowBatchTimer: 17.464us
PeakMemoryUsage: 0
RequestReceived: 1.00 B
RowsReturned: 1
RowsReturnedRate: 1.00 /sec
SenderTotalTime: 26.295us
SenderWaitLockTime: 176.000ns
- 慢查询:
- Profile信息
- 并行度:2;
- cbo是否开启:是
- be节点cpu和内存使用率截图
total used free shared buff/cache available
Mem: 7551 6278 160 0 1112 1016