偶尔报错【Resource temporarily unavailable】

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
偶尔报错Resource temporarily unavailable,
【背景】做过哪些操作?
报错期间存在其他任务同时在跑
【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.0.6
【集群规模】例如:3fe+5be
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】
报错信息:
W1111 04:30:28.175722 43204 mem_hook.cpp:266] large memory alloc: 1082472248 bytes, stack:
@ 0x5313403 malloc
@ 0x8dc4f85 operator new()
@ 0x329b36e std::vector<>::_M_range_insert<>()
@ 0x329eb04 starrocks::BinaryColumnBase<>::append()
@ 0x59d4b0a starrocks::NullableColumn::append()
@ 0x3398eb3 starrocks::JoinHashTable::append_chunk()
@ 0x37f6d2c starrocks::HashJoinBuilder::append_chunk()
@ 0x37f152c starrocks::HashJoiner::append_chunk_to_ht()
@ 0x36433a9 starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@ 0x32d33da starrocks::pipeline::PipelineDriver::process()
@ 0x5acbae2 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x5411902 starrocks::ThreadPool::dispatch_thread()
@ 0x540c3fa starrocks::thread::supervise_thread()
@ 0x7fa8040c7ea5 start_thread
@ 0x7fa8036e2b2d __clone
@ (nil) (unknown)
be.WARNING (47.1 MB)

看一下系统的 ulimit -n 和 ulimit -u 是否大于 65535

ulimit -n
65535
ulimit -u
2060259

看下be进程的,cat /proc/$be_pid/limits


参数配置不合理,参考 https://docs.starrocks.io/zh/docs/deployment/environment_configurations/ 把系统参数都调整下

另外要想让当前be进程生效,可以用下面方式
sudo prlimit --pid=$be_pid --nproc=65535:65535

已修改,待观察

15:30分,问题依然复现

所有be节点都改了ulimit了吗?lsof -p $be_pid|wc -l

[root@HD-RTDW-DOSDBBE01 ~]# lsof -p 20609|wc -l
18039
[root@HD-RTDW-DOSDBBE02 ~]# lsof -p 21679|wc -l
18048
[root@HD-RTDW-DOSDBBE03 ~]# lsof -p 53709|wc -l
17745
[root@HD-RTDW-DOSDBBE04 ~]# lsof -p 42661|wc -l
18034
[root@HD-RTDW-DOSDBBE05 ~]# lsof -p 50732|wc -l
18036

发下be.out

be(2).out (3.3 KB)

已发送

开了Spill导致的,触发具体原因,我先确认下

*** Aborted at 1700718528 (unix time) try "date -d @1700718528" if you are using GNU date ***
PC: @          0x36106b2 _ZNSt17_Function_handlerIFN9starrocks6StatusEvEZZNS0_8pipeline34SpillablePartitionSortSinkOperator13set_finishingEPNS0_12RuntimeStateEENKUlS6_T_E0_clISt10shared_ptrINS0_5spill14IOTaskExecutorEEEEDaS6_S7_EUlvE_E9_M_invokeERKSt9_Any_data
*** SIGSEGV (@0x0) received by PID 114639 (TID 0x7fadddc59700) from PID 0; stack trace: ***
    @          0x6641b82 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7faeae4ec630 (unknown)
    @          0x36106b2 _ZNSt17_Function_handlerIFN9starrocks6StatusEvEZZNS0_8pipeline34SpillablePartitionSortSinkOperator13set_finishingEPNS0_12RuntimeStateEENKUlS6_T_E0_clISt10shared_ptrINS0_5spill14IOTaskExecutorEEEEDaS6_S7_EUlvE_E9_M_invokeERKSt9_Any_data
    @          0x35de057 _ZNSt17_Function_handlerIFN9starrocks6StatusEvEZNS0_5spill7Spiller23set_flush_all_call_backINS3_23ResourceMemTrackerGuardIJSt8weak_ptrINS0_8pipeline12QueryContextEEEEEEES1_RKSt8functionIS2_EPNS0_12RuntimeStateERNS3_14IOTaskExecutorERKT_EUlvE_E9_M_invokeERKSt9_Any_data
    @          0x3675d25 starrocks::spill::SpillerWriter::_decrease_running_flush_tasks()
    @          0x35d57cb _ZZZN9starrocks5spill16RawSpillerWriter5flushIRNS0_14IOTaskExecutorERNS0_23ResourceMemTrackerGuardIJSt8weak_ptrINS_8pipeline12QueryContextEEEEEEENS_6StatusEPNS_12RuntimeStateEOT_OT0_ENKUlvE0_clEvENKUlvE0_clEv
    @          0x35d5b55 _ZNSt17_Function_handlerIFvvEZN9starrocks5spill16RawSpillerWriter5flushIRNS2_14IOTaskExecutorERNS2_23ResourceMemTrackerGuardIJSt8weak_ptrINS1_8pipeline12QueryContextEEEEEEENS1_6StatusEPNS1_12RuntimeStateEOT_OT0_EUlvE0_E9_M_invokeERKSt9_Any_data
    @          0x5221fc0 starrocks::PriorityThreadPool::work_thread()
    @          0x6601547 thread_proxy
    @     0x7faeae4e4ea5 start_thread
    @     0x7faeadaffb2d __clone
    @                0x0 (unknown)

be.warn里搜索下这个日志 Resource temporarily unavailable

哦哦,不好意思,贴错了贴了另外一个报错你,我找找找,生产没开splill

log.zip (90.0 KB) 重新上传了5个节点的OUT日志

W1116 13:58:17.525676 52633 tablet_updates.cpp:1331] wait_for_version slow(4411ms) version:1580595.1 tablet:124201901 #version:5334 [1575303.1 1580595.1@5333 1580595.1] pending: rowsets:4[
id/seg/row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1596945/1/1549479/0/238.07 MB
/17.93 MB]
W1116 14:00:24.996192 52633 tablet_updates.cpp:1331] wait_for_version slow(6524ms) version:1580716.1 tablet:124201901 #version:1919 [1578813 1580716.1@1918 1580716.1] pending: rowsets:4[id
/seg/row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1597067/1/1549480/0/238.06 MB/1
7.94 MB]
W1116 14:00:58.587306 51252 mem_hook.cpp:266] large memory alloc: 1606359846 bytes, stack:
@ 0x5313403 malloc
@ 0x8dc4f85 operator new()
@ 0x329b36e std::vector<>::_M_range_insert<>()
@ 0x329eb04 starrocks::BinaryColumnBase<>::append()
@ 0x59d4b0a starrocks::NullableColumn::append()
@ 0x3398eb3 starrocks::JoinHashTable::append_chunk()
@ 0x37f6d2c starrocks::HashJoinBuilder::append_chunk()
@ 0x37f152c starrocks::HashJoiner::append_chunk_to_ht()
@ 0x36433a9 starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@ 0x32d33da starrocks::pipeline::PipelineDriver::process()
@ 0x5acbae2 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x5411902 starrocks::ThreadPool::dispatch_thread()
@ 0x540c3fa starrocks::thread::supervise_thread()
@ 0x7f26046d0ea5 start_thread
@ 0x7f2603cebb2d __clone
@ (nil) (unknown)
W1116 14:01:19.162811 51130 fragment_context.cpp:123] [Driver] Canceled, query_id=8d0e79b8-8445-11ee-974e-84160c1734a8, instance_id=8d0e79b8-8445-11ee-974e-84160c1734af, reason=InternalErr
or
W1116 14:01:19.162855 51153 fragment_context.cpp:123] [Driver] Canceled, query_id=8d0e79b8-8445-11ee-974e-84160c1734a8, instance_id=8d0e79b8-8445-11ee-974e-84160c173540, reason=InternalErr
or
W1116 14:01:19.162863 51161 fragment_context.cpp:123] [Driver] Canceled, query_id=8d0e79b8-8445-11ee-974e-84160c1734a8, instance_id=8d0e79b8-8445-11ee-974e-84160c1734fc, reason=InternalErr
or
W1116 14:01:52.010386 52699 async_delta_writer.cpp:158] Fail to execution_queue_execute: 22
W1116 14:01:52.022840 52709 async_delta_writer.cpp:158] Fail to execution_queue_execute: 22
W1116 14:01:58.021355 52694 disposable_closure.h:50] brpc failed, error=Resource temporarily unavailable, error_text=[E11]Resource temporarily unavailable @172.28.200.245:8060
W1116 14:01:58.021441 52694 sink_buffer.cpp:372] transmit chunk rpc failed:a3b7deb4-8445-11ee-974e-84160c1734e1
W1116 14:01:58.054486 52698 disposable_closure.h:50] brpc failed, error=Resource temporarily unavailable, error_text=[E11]Resource temporarily unavailable @172.28.200.245:8060
W1116 14:01:58.054507 52698 sink_buffer.cpp:372] transmit chunk rpc failed:7154d9ab-8445-11ee-974e-84160c1735b2
W1116 14:01:58.247648 51154 fragment_context.cpp:123] [Driver] Canceled, query_id=a3b7deb4-8445-11ee-974e-84160c1734a8, instance_id=a3b7deb4-8445-11ee-974e-84160c1734f8, reason=InternalErr
or
W1116 14:01:58.247730 51139 fragment_context.cpp:123] [Driver] Canceled, query_id=a3b7deb4-8445-11ee-974e-84160c1734a8, instance_id=a3b7deb4-8445-11ee-974e-84160c1734b4, reason=InternalErr
or
W1116 14:01:58.247833 51159 fragment_context.cpp:123] [Driver] Canceled, query_id=a3b7deb4-8445-11ee-974e-84160c1734a8, instance_id=a3b7deb4-8445-11ee-974e-84160c1734ec, reason=InternalErr
or
W1116 14:01:58.247836 51151 fragment_context.cpp:123] [Driver] Canceled, query_id=a3b7deb4-8445-11ee-974e-84160c1734a8, instance_id=a3b7deb4-8445-11ee-974e-84160c1734b0, reason=InternalErr
or
W1116 14:01:58.247896 51119 fragment_context.cpp:123] [Driver] Canceled, query_id=a3b7deb4-8445-11ee-974e-84160c1734a8, instance_id=a3b7deb4-8445-11ee-974e-84160c1734e1, reason=InternalErr
or
W1116 14:01:58.248296 51250 tablet_sink.cpp:1635] close channel failed. channel_name=NodeChannel[28714], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.248314 51250 tablet_sink.cpp:1635] close channel failed. channel_name=NodeChannel[28715], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.248317 51250 tablet_sink.cpp:1635] close channel failed. channel_name=NodeChannel[28713], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.248320 51250 tablet_sink.cpp:1635] close channel failed. channel_name=NodeChannel[28686], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.248334 51250 tablet_sink.cpp:1635] close channel failed. channel_name=NodeChannel[28725], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.263705 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28714], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.263734 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28715], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.263739 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28713], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.263747 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28686], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.263751 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28725], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.268877 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28714], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.268890 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28715], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.268893 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28713], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.268896 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28686], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:01:58.268908 51275 tablet_sink.cpp:1815] close channel failed. channel_name=NodeChannel[28725], load_info=load_id=a3b7deb4-8445-11ee-974e-84160c1734a8, txn_id: 339163411, parallel
=1, compress_type=2, error_msg=Cancelled by pipeline engine
W1116 14:03:25.399272 52633 tablet_updates.cpp:1331] wait_for_version slow(4072ms) version:1580921.1 tablet:124201901 #version:2126 [1578813 1580921.1@2124 1580922] pending: rowsets:61[id/
seg/row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1597217/1/1/0/5.04 KB/256.00 MB]
,[1597218/1/7/0/6.46 KB/255.99 MB],[1597219/1/1/0/4.95 KB/256.00 MB],[1597220/1/4/0/5.95 KB/255.99 MB],[1597221/1/1/0/4.84 KB/256.00 MB],[1597222/1/3/0/5.53 KB/255.99 MB],[1597223/1/3/0/5.
42 KB/255.99 MB],[1597224/1/1/0/5.01 KB/256.00 MB]…,[1597266/0/0/0/0/256.00 MB],[1597267/0/0/0/0/256.00 MB],[1597268/1/5/0/5.80 KB/255.99 MB],[1597269/1/1/0/5.06 KB/256.00 MB],[1597270/1
/3/0/5.44 KB/255.99 MB],[1597271/0/0/0/0/256.00 MB],[1597272/1/10/0/7.25 KB/255.99 MB],[1597273/1/1549501/0/238.11 MB/17.89 MB],[1597274/0/0/0/0/256.00 MB]
W1116 14:03:25.453961 52497 tablet_updates.cpp:1331] wait_for_version slow(4096ms) version:1580922 tablet:124201901 #version:2126 [1578813 1580922@2125 1580922] pending: rowsets:61[id/seg/
row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1597217/1/1/0/5.04 KB/256.00 MB],[15
97218/1/7/0/6.46 KB/255.99 MB],[1597219/1/1/0/4.95 KB/256.00 MB],[1597220/1/4/0/5.95 KB/255.99 MB],[1597221/1/1/0/4.84 KB/256.00 MB],[1597222/1/3/0/5.53 KB/255.99 MB],[1597223/1/3/0/5.42 K
B/255.99 MB],[1597224/1/1/0/5.01 KB/256.00 MB]…,[1597266/0/0/0/0/256.00 MB],[1597267/0/0/0/0/256.00 MB],[1597268/1/5/0/5.80 KB/255.99 MB],[1597269/1/1/0/5.06 KB/256.00 MB],[1597270/1/3/0
/5.44 KB/255.99 MB],[1597271/0/0/0/0/256.00 MB],[1597272/1/10/0/7.25 KB/255.99 MB],[1597273/1/1549501/124/238.11 MB/17.99 MB],[1597274/0/0/0/0/256.00 MB]
W1116 14:05:32.229876 52633 tablet_updates.cpp:1331] wait_for_version slow(3844ms) version:1581068.1 tablet:124201901 #version:2274 [1578813 1581068.1@2272 1581069] pending: rowsets:38[id/
seg/row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1597388/1/3/0/5.59 KB/255.99 MB]
,[1597389/0/0/0/0/256.00 MB],[1597390/0/0/0/0/256.00 MB],[1597391/1/5/0/5.80 KB/255.99 MB],[1597392/1/1/0/5.06 KB/256.00 MB],[1597393/1/3/0/5.44 KB/255.99 MB],[1597394/1/1/0/4.93 KB/256.00
MB],[1597395/1/13/0/7.66 KB/255.99 MB]…,[1597414/1/1/0/5.08 KB/256.00 MB],[1597415/1/6/0/6.21 KB/255.99 MB],[1597416/1/2/0/5.41 KB/255.99 MB],[1597417/1/1/0/5.03 KB/256.00 MB],[1597418/
0/0/0/0/256.00 MB],[1597419/1/1/0/5.02 KB/256.00 MB],[1597420/1/1/0/4.95 KB/256.00 MB],[1597421/1/1549509/0/238.12 MB/17.88 MB],[1597422/1/1/0/4.96 KB/256.00 MB]
W1116 14:05:32.293366 52497 tablet_updates.cpp:1331] wait_for_version slow(3664ms) version:1581069 tablet:124201901 #version:2274 [1578813 1581069@2273 1581069] pending: rowsets:38[id/seg/
row/del/byte/compaction]: [364/2/7224876/9982/1.03 GB/-789.77 MB],[669/1/3358195/4959/496.63 MB/-236.96 MB],[1012/1/3288537/4618/482.72 MB/-223.33 MB],[1597388/1/3/0/5.59 KB/255.99 MB],[15:

可能是 brpc的缺陷,在跟upstream优化 https://github.com/apache/brpc/issues/2477