stream load 导入后出现be节点异常

【详述】
在通过stream load 用json格式连续导入 4.4GB 的表数据时,当导入第三个文件出现be节点不可访问的问题。
执行SQL查询会报错如下:

SQL 错误 [1064] [42000]: Backend not found. Check if any backend is down or not

show backends命令查看be节点状态如下:

BackendId|IP           |HeartbeatPort|BePort|HttpPort|BrpcPort|LastStartTime      |LastHeartbeat      |Alive|SystemDecommissioned|ClusterDecommissioned|TabletNum|DataUsedCapacity|AvailCapacity|TotalCapacity|UsedPct|MaxDiskUsedPct|ErrMsg                                              |Version      |Status                                                |DataTotalCapacity|DataUsedPct|CpuCores|
---------|-------------|-------------|------|--------|--------|-------------------|-------------------|-----|--------------------|---------------------|---------|----------------|-------------|-------------|-------|--------------|----------------------------------------------------|-------------|------------------------------------------------------|-----------------|-----------|--------|
10029    |10.0.0.1     |9350         |9360  |8340    |8360    |2022-11-16 15:13:24|2022-11-17 09:36:08|false|false               |false                |85       |1.355 GB        |569.232 GB   |784.392 GB   |27.43 %|27.43 %       |java.net.ConnectException: 拒绝连接 (Connection refused)|2.4.1-cc3c302|{"lastSuccessReportTabletsTime":"2022-11-17 09:35:26"}|570.588 GB       |0.24 %     |48      |
10034    |10.0.0.3     |9350         |9360  |8340    |8360    |2022-11-16 15:13:54|2022-11-17 09:36:13|false|false               |false                |85       |1.356 GB        |606.692 GB   |784.392 GB   |22.65 %|22.65 %       |java.net.ConnectException: 拒绝连接 (Connection refused)|2.4.1-cc3c302|{"lastSuccessReportTabletsTime":"2022-11-17 09:35:56"}|608.048 GB       |0.22 %     |48      |
10033    |10.0.0.2     |9350         |9360  |8340    |8360    |2022-11-16 15:13:39|2022-11-17 09:36:13|false|false               |false                |85       |1.354 GB        |525.872 GB   |784.392 GB   |32.96 %|32.96 %       |java.net.SocketTimeoutException: Read timed out     |2.4.1-cc3c302|{"lastSuccessReportTabletsTime":"2022-11-17 09:35:41"}|527.226 GB       |0.26 %     |48      |

查看be节点 be.WARNING 文件内容如下

W1117 09:34:08.612994 36388 mem_hook.cpp:255] large memory alloc: 2147483648 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7fe74892aea5  start_thread
    @     0x7fe747f45b0d  __clone
    @              (nil)  (unknown)
W1117 09:34:19.651793 36388 mem_hook.cpp:255] large memory alloc: 4294967296 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7fe74892aea5  start_thread
    @     0x7fe747f45b0d  __clone
    @              (nil)  (unknown)
W1117 09:34:44.107417 36388 mem_hook.cpp:255] large memory alloc: 8589934592 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7fe74892aea5  start_thread
    @     0x7fe747f45b0d  __clone
    @              (nil)  (unknown)

请问下是由于一次导入的文件4.4GB 太大了么? 但是同样的操作在 2.3.3版本就没有出现,是2.4.1对导入支持的最大文件大小有所降低嘛?

【导入/导出方式】stream load
【StarRocks版本】2.4.1
【集群规模】2fe(1 follower+1observer)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,48C/48G/千兆
【表模型】主键模型
【导入或者导出方式】stream load

be.out文件发下看看,另外dmesg -T看看be是不是OOM了

这是 dmesg -T 的最后部分的内容 只有14号的信息,没有17号的

[一 11月 14 16:38:10 2022] vmw_vmci 0000:00:07.7: irq 66 for MSI/MSI-X
[一 11月 14 16:38:10 2022] Guest personality initialized and is active
[一 11月 14 16:38:10 2022] VMCI host device registered (name=vmci, major=10, minor=58)
[一 11月 14 16:38:10 2022] Initialized host personality
[一 11月 14 16:38:10 2022] sd 2:0:0:0: Attached scsi generic sg0 type 0
[一 11月 14 16:38:10 2022] sd 2:0:1:0: Attached scsi generic sg1 type 0
[一 11月 14 16:38:10 2022] sr 1:0:0:0: Attached scsi generic sg2 type 5
[一 11月 14 16:38:10 2022] input: PC Speaker as /devices/platform/pcspkr/input/input4
[一 11月 14 16:38:10 2022] ppdev: user-space parallel port driver
[一 11月 14 16:38:10 2022] Adding 15626236k swap on /dev/mapper/centos-swap. Priority:-1 extents:1 across:15626236k FS
[一 11月 14 16:38:10 2022] cryptd: max_cpu_qlen set to 100
[一 11月 14 16:38:10 2022] AVX2 version of gcm_enc/dec engaged.
[一 11月 14 16:38:10 2022] AES CTR mode by8 optimization enabled
[一 11月 14 16:38:10 2022] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
[一 11月 14 16:38:10 2022] alg: No test for __generic-gcm-aes-aesni (__driver-generic-gcm-aes-aesni)
[一 11月 14 16:38:10 2022] XFS (sda1): Mounting V5 Filesystem
[一 11月 14 16:38:31 2022] XFS (sda1): Starting recovery (logdev: internal)
[一 11月 14 16:38:31 2022] XFS (sda1): Ending recovery (logdev: internal)
[一 11月 14 16:38:31 2022] RPC: Registered named UNIX socket transport module.
[一 11月 14 16:38:31 2022] RPC: Registered udp transport module.
[一 11月 14 16:38:31 2022] RPC: Registered tcp transport module.
[一 11月 14 16:38:31 2022] RPC: Registered tcp NFSv4.1 backchannel transport module.
[一 11月 14 16:38:31 2022] type=1305 audit(1668415113.323:3): audit_pid=1176 old=0 auid=4294967295 ses=4294967295 res=1
[一 11月 14 16:38:33 2022] NET: Registered protocol family 40
[一 11月 14 16:38:33 2022] IPv6: ADDRCONF(NETDEV_UP): ens160: link is not ready
[一 11月 14 16:38:33 2022] vmxnet3 0000:03:00.0 ens160: intr type 3, mode 0, 9 vectors allocated
[一 11月 14 16:38:33 2022] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
[一 11月 14 16:38:35 2022] ip6_tables: © 2000-2006 Netfilter Core Team
[一 11月 14 16:38:35 2022] Ebtables v2.0 registered
[一 11月 14 16:38:35 2022] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[一 11月 14 16:38:35 2022] tun: Universal TUN/TAP device driver, 1.6
[一 11月 14 16:38:35 2022] tun: © 1999-2004 Max Krasnyansky maxk@qualcomm.com
[一 11月 14 16:38:35 2022] virbr0: port 1(virbr0-nic) entered blocking state
[一 11月 14 16:38:35 2022] virbr0: port 1(virbr0-nic) entered disabled state
[一 11月 14 16:38:35 2022] device virbr0-nic entered promiscuous mode
[一 11月 14 16:38:35 2022] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[一 11月 14 16:38:35 2022] virbr0: port 1(virbr0-nic) entered blocking state
[一 11月 14 16:38:35 2022] virbr0: port 1(virbr0-nic) entered listening state
[一 11月 14 16:38:35 2022] IPv6: ADDRCONF(NETDEV_UP): virbr0: link is not ready
[一 11月 14 16:38:36 2022] virbr0: port 1(virbr0-nic) entered disabled state

这是 be.out 的内容

start time: 2022年 11月 16日 星期三 15:13:19 CST
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:70490f38-4a82-3e6e-1a5b-9c31a60fca80, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1668648971 (unix time) try "date -d @1668648971" if you are using GNU date ***
PC: @     0x7fe747e7d387 __GI_raise
*** SIGABRT (@0x3ed00008b41) received by PID 35649 (TID 0x7fe6e7efc700) from PID 35649; stack trace: ***
    @          0x4825332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fe748932630 (unknown)
    @     0x7fe747e7d387 __GI_raise
    @     0x7fe747e7ea78 __GI_abort
    @          0x1c4b44f _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a7bb6 __cxxabiv1::__terminate()
    @          0x62a7c21 std::terminate()
    @          0x62a7d74 __cxa_throw
    @          0x1c4b356 _Znwm.cold
    @          0x1dc6370 std::vector<>::_M_default_append()
    @          0x1e44a9f starrocks::vectorized::BinaryColumnBase<>::append_selective()
    @          0x3be7ba3 starrocks::vectorized::NullableColumn::append_selective()
    @          0x3bcd46a starrocks::vectorized::Chunk::append_selective()
    @          0x3c31c19 starrocks::stream_load::NodeChannel::add_chunk()
    @          0x3c32a25 starrocks::stream_load::OlapTableSink::_send_chunk_by_node()
    @          0x3c34502 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x367f761 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x36801f7 starrocks::PlanFragmentExecutor::open()
    @          0x35eb98b starrocks::FragmentExecState::execute()
    @          0x35f0143 starrocks::FragmentMgr::exec_actual()
    @          0x35f0681 _ZNSt17_Function_handlerIFvvEZN9starrocks11FragmentMgr18exec_plan_fragmentERKNS1_23TExecPlanFragmentParamsERKSt8functionIFvPNS1_20PlanFragmentExecutorEEESC_EUlvE_E9_M_invokeERKSt9_Any_data
    @          0x3762a65 starrocks::ThreadPool::dispatch_thread()
    @          0x375df8a starrocks::Thread::supervise_thread()
    @     0x7fe74892aea5 start_thread
    @     0x7fe747f45b0d __clone
terminate called recursively
    @                0x0 (unknown)

在fe的fe.audit.log里面grep一下70490f38-4a82-3e6e-1a5b-9c31a60fca80看看具体是什么sql,发过来看下

你的be是17号挂掉的?

是的,确切时间是 17号9点35左右挂的

没找到相关信息呢。。

你看下be的进程还在不?看您发的be.out文件没生成最新的堆栈,show backends命令在fe 的leader节点执行看alive也是false状态吗?看下11-17 09:35be的warnning日志打印的什么?

刚看了下三个BE进程都不在了,我刚刚已经把集群重启了

show backends命令的结果在第一个内容里粘出来了

be的warnning日志指的是 be.WARNING 文件么?我也在第一个内容里粘出来啦

机器内存是多大,fe和be之间能不能正常通信?

有一台服务器上面同时部分的 fe 和 be 他俩之间统信肯定没问题呀,其他两个服务器上的be 网络都没有问题,执行导入前都正常的,而且是导入两个之后 导入第三个出现的问题。
而且现在发现集群启不来了,我还在看是啥情况。。

image

后来集群正常启来了,又执行了一次连续的导入,每个导入文件4.4GB ,这回再导入到第5个文件的时候be节点连不上了。但是be的进程还在。下面列出了相关be节点的日志信息,可以帮忙看看什么原因嘛?

查看be节点的 be.out 内容如下:

BE1

start time: 2022年 11月 17日 星期四 13:47:12 CST
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1668667334 (unix time) try "date -d @1668667334" if you are using GNU date ***
PC: @     0x7f78e8c2d387 __GI_raise
*** SIGABRT (@0x3ed000094bb) received by PID 38075 (TID 0x7f77a8bf6700) from PID 38075; stack trace: ***
    @          0x4825332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f78e96e2630 (unknown)
    @     0x7f78e8c2d387 __GI_raise
    @     0x7f78e8c2ea78 __GI_abort
    @          0x1c4b44f _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a7bb6 __cxxabiv1::__terminate()
    @          0x62a7c21 std::terminate()
    @          0x62a7d74 __cxa_throw
    @          0x1c4b356 _Znwm.cold
    @          0x3745719 starrocks::faststring::GrowArray()
    @          0x3358a0e starrocks::TypeEncodingTraits<>::create_page_builder()
    @          0x333ba99 std::_Function_handler<>::_M_invoke()
    @          0x357d4ce starrocks::IndexedColumnWriter::init()
    @          0x33b1d21 starrocks::ZoneMapIndexWriterImpl<>::finish()
    @          0x3333c78 starrocks::StringColumnWriter::write_zone_map()
    @          0x2fadd05 starrocks::SegmentWriter::finalize_columns()
    @          0x3596599 starrocks::VerticalBetaRowsetWriter::flush_columns()
    @          0x3088a53 starrocks::vectorized::RowsetMergerImpl<>::_do_merge_horizontally()
    @          0x3088f7a starrocks::vectorized::RowsetMergerImpl<>::_do_merge_vertically()
    @          0x308ae43 starrocks::vectorized::RowsetMergerImpl<>::do_merge()
    @          0x307e76c starrocks::vectorized::compaction_merge_rowsets()
    @          0x2f5f6fc starrocks::TabletUpdates::_do_compaction()
    @          0x2f606ec starrocks::TabletUpdates::compaction()
    @          0x2ec6ce9 starrocks::StorageEngine::_perform_update_compaction()
    @          0x30de20e starrocks::StorageEngine::_update_compaction_thread_callback()
    @          0x6321d20 execute_native_thread_routine
    @     0x7f78e96daea5 start_thread
    @     0x7f78e8cf5b0d __clone
    @                0x0 (unknown)

BE2

start time: 2022年 11月 17日 星期四 13:47:14 CST
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1668667217 (unix time) try "date -d @1668667217" if you are using GNU date ***
PC: @     0x7f47cba0a387 __GI_raise
*** SIGABRT (@0x3ee0000825a) received by PID 33370 (TID 0x7f46731f6700) from PID 33370; stack trace: ***
    @          0x4825332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f47cc4bf630 (unknown)
    @     0x7f47cba0a387 __GI_raise
    @     0x7f47cba0ba78 __GI_abort
    @          0x1c4b44f _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a7bb6 __cxxabiv1::__terminate()
    @          0x62a7c21 std::terminate()
    @          0x62a7d74 __cxa_throw
    @          0x1c4b356 _Znwm.cold
    @          0x4817f31 google::LogMessage::Init()
    @          0x48185f1 google::LogMessage::LogMessage()
    @          0x494e7f2 brpc::InputMessenger::OnNewMessages()
    @          0x49f554e brpc::Socket::ProcessEvent()
    @          0x49035ff bthread::TaskGroup::task_runner()
    @          0x4a8bcc1 bthread_make_fcontext

BE3

start time: 2022年 11月 17日 星期四 13:47:15 CST
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1668667186 (unix time) try "date -d @1668667186" if you are using GNU date ***
PC: @     0x7f4b83cd4387 __GI_raise
*** SIGABRT (@0x3ed000067cd) received by PID 26573 (TID 0x7f4a10b47700) from PID 26573; stack trace: ***
    @          0x4825332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f4b84789630 (unknown)
    @     0x7f4b83cd4387 __GI_raise
    @     0x7f4b83cd5a78 __GI_abort
    @          0x1c4b44f _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x62a7bb6 __cxxabiv1::__terminate()
    @          0x62a7c21 std::terminate()
    @          0x62a7d74 __cxa_throw
    @          0x1c4b356 _Znwm.cold
    @          0x3baa5f9 starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98 evhttp_read_body
    @          0x48b1273 bufferevent_readcb
    @          0x489d852 event_process_active_single_queue
    @          0x489df8f event_base_loop
    @          0x3b8d474 _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20 execute_native_thread_routine
    @     0x7f4b84781ea5 start_thread
    @     0x7f4b83d9cb0d __clone
    @                0x0 (unknown)

be.WARNING文件信息如下:

BE1

W1117 13:47:16.737025 38926 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.737251 38926 task_worker_pool.cpp:1069] Fail to report task to 172.18.22.221:9320, err=-1
W1117 13:47:16.737488 38929 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.737519 38929 task_worker_pool.cpp:1207] Fail to report workgroup to 172.18.22.221:9320, err=-1
W1117 13:47:16.737761 38927 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.737810 38927 task_worker_pool.cpp:1124] Fail to report disk state to 172.18.22.221:9320, err=-1
W1117 13:47:16.737998 38928 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.738018 38928 task_worker_pool.cpp:1171] Fail to report olap table state to 172.18.22.221:9320, err=-1
W1117 13:47:16.738099 38927 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.738111 38927 task_worker_pool.cpp:1124] Fail to report disk state to 172.18.22.221:9320, err=-1
W1117 13:47:16.738219 38928 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.738229 38928 task_worker_pool.cpp:1171] Fail to report olap table state to 172.18.22.221:9320, err=-1
W1117 14:33:50.214514 39068 mem_hook.cpp:255] large memory alloc: 2147483648 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f78e96daea5  start_thread
    @     0x7f78e8cf5b0d  __clone
    @              (nil)  (unknown)
W1117 14:34:02.601613 39068 mem_hook.cpp:255] large memory alloc: 4294967296 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f78e96daea5  start_thread
    @     0x7f78e8cf5b0d  __clone
    @              (nil)  (unknown)
W1117 14:34:27.811841 39068 mem_hook.cpp:255] large memory alloc: 8589934592 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f78e96daea5  start_thread
    @     0x7f78e8cf5b0d  __clone
    @              (nil)  (unknown)
W1117 14:37:21.360016 38953 fragment_context.cpp:19] [Driver] Canceled, query_id=49aafacf-6642-11ed-af55-5254004a05f5, instance_id=49aafacf-6642-11ed-af55-5254004a05f8, reason=Cancelled: LimitReach

BE2

E1117 11:01:53.853173 15565 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:02.856751 15570 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:05.858712 15641 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:17.860421 15565 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:26.861660 15621 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:29.864372 15646 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:32.864802 15570 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:35.865221 15633 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:41.865881 15646 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:44.866266 15633 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:50.867048 15621 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:53.867386 15646 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:02:56.867825 15570 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:11.869614 15570 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:14.869956 15621 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:29.871773 15565 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:38.872864 15641 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:41.873261 15621 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:50.874250 15621 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:03:53.874636 15646 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:04:08.876492 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:04:17.877595 15641 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:04:23.878315 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:04:59.882854 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:05:08.883993 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:05:20.885602 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:05:35.887687 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:05:38.887996 15633 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
E1117 11:05:50.889395 15649 stack.cpp:96] Fail to mmap size=1052672 stack_count=171, possibly limited by /proc/sys/vm/max_map_count: Cannot allocate memory [12]
W1117 13:46:31.412631 29482 utils.cpp:98] master client, retry finishTask: No more data to read.
W1117 13:46:31.413055 29482 utils.cpp:102] Fail to get master client from cache. host=172.18.22.221, port=9320, code=THRIFT_RPC_ERROR
W1117 13:46:31.413087 29482 task_worker_pool.cpp:1207] Fail to report workgroup to 172.18.22.221:9320, err=-1
W1117 13:47:16.627445 34217 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.627594 34217 task_worker_pool.cpp:1069] Fail to report task to 172.18.22.221:9320, err=-1
W1117 13:47:16.627930 34220 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.627954 34220 task_worker_pool.cpp:1207] Fail to report workgroup to 172.18.22.221:9320, err=-1
W1117 13:47:16.628441 34218 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.628531 34218 task_worker_pool.cpp:1124] Fail to report disk state to 172.18.22.221:9320, err=-1
W1117 13:47:16.628866 34219 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.628895 34219 task_worker_pool.cpp:1171] Fail to report olap table state to 172.18.22.221:9320, err=-1
W1117 13:47:16.629189 34218 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.629212 34218 task_worker_pool.cpp:1124] Fail to report disk state to 172.18.22.221:9320, err=-1
W1117 13:47:16.629483 34219 utils.cpp:85] Fail to get master client from cache. host=172.18.22.221 port=9320 code=THRIFT_RPC_ERROR
W1117 13:47:16.629519 34219 task_worker_pool.cpp:1171] Fail to report olap table state to 172.18.22.221:9320, err=-1
W1117 14:28:56.367556 34363 mem_hook.cpp:255] large memory alloc: 2147483648 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:29:14.921845 34363 mem_hook.cpp:255] large memory alloc: 4294967296 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:29:50.245283 34363 mem_hook.cpp:255] large memory alloc: 8589934592 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:36:20.788296 34367 mem_hook.cpp:255] large memory alloc: 2147483648 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:36:38.511837 34367 mem_hook.cpp:255] large memory alloc: 4294967296 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:37:14.707175 34367 mem_hook.cpp:255] large memory alloc: 8589934592 bytes, stack:
    @          0x368dbeb  malloc
    @          0x62a8105  operator new()
    @          0x3baa5f9  starrocks::StreamLoadAction::on_chunk_data()
    @          0x48adb98  evhttp_read_body
    @          0x48b1273  bufferevent_readcb
    @          0x489d852  event_process_active_single_queue
    @          0x489df8f  event_base_loop
    @          0x3b8d474  _ZZN9starrocks12EvHttpServer5startEvENKUlvE_clEv
    @          0x6321d20  execute_native_thread_routine
    @     0x7f47cc4b7ea5  start_thread
    @     0x7f47cbad2b0d  __clone
    @              (nil)  (unknown)
W1117 14:37:21.361085 34320 fragment_context.cpp:19] [Driver] Canceled, query_id=49aafacf-6642-11ed-af55-5254004a05f5, instance_id=49aafacf-6642-11ed-af55-5254004a05f6, reason=Cancelled: LimitReach
W1117 14:40:17.375574 34299 input_messenger.cpp:214] Fail to read from Socket{id=1408 fd=301 addr=172.18.22.221:46772:8360} (0x7f462b3cdfc0): Cannot allocate memory [12]

两个服务器是 65535 一个服务器是 131072 ,我都调整为131072 试试
image
image

内存不够导致的,你系统的ulimit -n设置的多大?

vm.overcommit_memory 设置为1 问题解决

echo “vm.overcommit_memory=1” > /etc/sysctl.conf
sysctl -p