【存算一体】StarRocks3.3.0版本on k8s部署,BE crash

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.3.0
【集群规模】例如:3fe(1 follower+2observer)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:16C/64G/万兆
【联系方式】社区群15,redscarf
【附件】

  • be.INFO
I0816 19:12:18.226403  1042 compaction_task.cpp:39] start compaction. task_id:22416, tablet:2629792, algorithm:VERTICAL_COMPACTION, compaction_type:cumulative, compaction_score:59.0758, output_version:[54768-54785], input rowsets size:18
I0816 19:12:18.226452  1044 compaction_manager.cpp:87] submit task to compaction pool, task_id:22418, tablet_id:2629804, compaction_type:cumulative, compaction_score:57.258 for round:22554, candidates_size:0
E0816 19:12:18.226548  1041 threadpool.cpp:455] Thread pool failed to create thread: Runtime error: Could not create thread: Resource temporarily unavailable
I0816 19:12:18.226768 123893 compaction_task.cpp:39] start compaction. task_id:22417, tablet:2629798, algorithm:VERTICAL_COMPACTION, compaction_type:cumulative, compaction_score:66.522, output_version:[54768-54785], input rowsets size:18
I0816 19:12:18.226778  1040 size_tiered_compaction_policy.cpp:353] pick tablet 2629810 for size-tiered compaction rowset version=54769-54785 score=40 level_size=66604 total_size=1081409 segment_num=17 force_base_compaction=0 reached_max_versions=0
I0816 19:12:18.226843  1044 compaction_manager.cpp:87] submit task to compaction pool, task_id:22419, tablet_id:2629810, compaction_type:cumulative, compaction_score:71.6143 for round:22555, candidates_size:0
I0816 19:12:18.313393  1153 stream_load.cpp:243] new income streaming load request.id=e6480452682071e3-2374d5ca100363bc, job_id=-1, txn_id: -1, label=flinkx_connector_20240816_191210_9f6da15928414e73a54c57f522e0d4a7, db=clog, db=clog, tbl=containers_log
I0816 19:12:18.314041  1150 stream_load.cpp:243] new income streaming load request.id=2145c51841d732d1-b88453f4cf7fd8b4, job_id=-1, txn_id: -1, label=flinkx_connector_20240816_191210_ed58e0cd6e644bdfa2c096e574763c7a, db=clog, db=clog, tbl=containers_log
I0816 19:12:18.314802  1078 local_tablets_channel.cpp:711] LocalTabletsChannel txn_id: 1564623 load_id: 88401629-4aeb-2747-3fd6-8b5343d278b1 open 8 delta writers, 0 failed_tablets:  _num_remaining_senders: 1
I0816 19:12:18.317302  1152 stream_load.cpp:243] new income streaming load request.id=e1411c820c4bc9f7-7fe8e3219ff36f9a, job_id=-1, txn_id: -1, label=flinkx_connector_20240816_191211_19163d60242c48afaebb16a15686bb0a, db=clog, db=clog, tbl=containers_log
I0816 19:12:18.318394  1153 stream_load_executor.cpp:77] begin to execute job. label=flinkx_connector_20240816_191210_9f6da15928414e73a54c57f522e0d4a7, txn_id: 1564625, query_id=e6480452-6820-71e3-2374-d5ca100363bc
I0816 19:12:18.318485  1153 plan_fragment_executor.cpp:83] Prepare(): query_id=e6480452-6820-71e3-2374-d5ca100363bc fragment_instance_id=e6480452-6820-71e3-2374-d5ca100363bd backend_num=0
I0816 19:12:18.319118  1090 local_tablets_channel.cpp:711] LocalTabletsChannel txn_id: 1564624 load_id: 9c4ed4a8-dfc8-fa9b-893a-cf58dbee709c open 8 delta writers, 0 failed_tablets:  _num_remaining_senders: 1
I0816 19:12:18.319233  1150 stream_load_executor.cpp:77] begin to execute job. label=flinkx_connector_20240816_191210_ed58e0cd6e644bdfa2c096e574763c7a, txn_id: 1564626, query_id=2145c518-41d7-32d1-b884-53f4cf7fd8b4
I0816 19:12:18.319335  1150 plan_fragment_executor.cpp:83] Prepare(): query_id=2145c518-41d7-32d1-b884-53f4cf7fd8b4 fragment_instance_id=2145c518-41d7-32d1-b884-53f4cf7fd8b5 backend_num=0
I0816 19:12:18.319592   747 plan_fragment_executor.cpp:192] Open(): fragment_instance_id=e6480452-6820-71e3-2374-d5ca100363bd
I0816 19:12:18.320350   749 plan_fragment_executor.cpp:192] Open(): fragment_instance_id=2145c518-41d7-32d1-b884-53f4cf7fd8b5
I0816 19:12:18.320616  1100 local_tablets_channel.cpp:711] LocalTabletsChannel txn_id: 1564625 load_id: e6480452-6820-71e3-2374-d5ca100363bc open 8 delta writers, 0 failed_tablets:  _num_remaining_senders: 1
I0816 19:12:18.321417  1105 local_tablets_channel.cpp:711] LocalTabletsChannel txn_id: 1564626 load_id: 2145c518-41d7-32d1-b884-53f4cf7fd8b4 open 8 delta writers, 0 failed_tablets:  _num_remaining_senders: 1
E0816 19:12:18.321874   747 threadpool.cpp:455] Thread pool failed to create thread: Runtime error: Could not create thread: Resource temporarily unavailable
E0816 19:12:18.322134   749 threadpool.cpp:455] Thread pool failed to create thread: Runtime error: Could not create thread: Resource temporarily unavailable
  • be crash
    • be.out
start time: Sun Aug 18 22:46:25 CST 2024, server uptime:  22:46:25 up 158 days,  8:08,  0 users,  load average: 154.89, 52.10, 22.27
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
3.3.0 RELEASE (build 19a3f66)
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
tracker:process consumption: 2561198336
tracker:query_pool consumption: 0
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 30574239
tracker:tablet_metadata consumption: 10644196
tracker:rowset_metadata consumption: 11687850
tracker:segment_metadata consumption: 1472779
tracker:column_metadata consumption: 6769414
tracker:tablet_schema consumption: 544252
tracker:segment_zonemap consumption: 546681
tracker:short_key_index consumption: 804869
tracker:column_zonemap_index consumption: 1172670
tracker:ordinal_index consumption: 1792736
tracker:bitmap_index consumption: 15600
tracker:bloom_filter_index consumption: 30840
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 883585312
tracker:jit_cache consumption: 7624
tracker:update consumption: 2847660
tracker:chunk_allocator consumption: 43188984
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1724039747 (unix time) try "date -d @1724039747" if you are using GNU date ***
PC: @     0x7f22800979fc pthread_kill
*** SIGABRT (@0x18) received by PID 24 (TID 0x7f2156683640) from PID 24; stack trace: ***
    @          0x990848a google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f2280043520 (unknown)
    @     0x7f22800979fc pthread_kill
    @     0x7f2280043476 raise
    @     0x7f22800297f3 abort
    @          0xe787763 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0xe785e1c __cxxabiv1::__terminate()
    @          0xe785e87 std::terminate()
    @          0xe785fe9 __cxa_throw
    @          0xe820cc8 std::__throw_system_error()
    @          0xe820f3d std::thread::_M_start_thread()
    @          0x98e8c6f apache::thrift::server::TThreadedServer::onClientConnected()
    @          0x98e45d3 apache::thrift::server::TServerFramework::serve()
    @          0x98e9432 apache::thrift::server::TThreadedServer::serve()
    @          0x8309ac3 starrocks::ThriftServer::ThriftServerEventProcessor::supervise()
    @          0xe820e54 execute_native_thread_routine
    @     0x7f2280095ac3 (unknown)
    @     0x7f2280126a04 clone
    @                0x0 (unknown)



目前3个BE均出现不同程度的重启,thrift_server 这个线程一直处于增长

总结:
1、现象一:thrift_server线程到达3000多之后(大约12个小时左右),be重启
2、现象二:be的线程总数达到3万左右,be重启

pstack 信息
pstack-10002.txt (40.9 KB)
pstack-10215.txt (41.8 KB)

参考: [问题排查]BE Crash - :books: StarRocks 最佳实践 / 问题排查 - StarRocks中文社区论坛 (mirrorship.cn),排查结果如下:



在: 常见 Crash / BUG / 优化 查询 - :speech_balloon: StarRocks 用户问答 - StarRocks中文社区论坛 (mirrorship.cn) 查了一下,暂未查到相同的错误

当时的fe leader日志和BE info日志发我下

好的,稍等

8月20号还有重启,所以直接拿了8月20号的日志了

be.log.zip (36.8 MB)

fe-0820.log.zip (14.9 MB)

大佬有发现是什么原因导致的么?

稍等,今天看下

好的,感谢

cat /proc/进程号/limits 发下

9060这个端口没暴露到外网吧

没有暴露,我们都是service访问的

2024-08-20 12:50:58 这个时间点附近,其它FE的日志还有吗

现在看BE的日志,应该是 FE和这个BE创建了大量的Connection,每个Connection会创建1个线程,从而导致大量创建线程,但是是什么原因,创建了这么多Connection,还不清楚,

我给的日志就是master那一天完整的日志了,follower的日志要嘛?

thrift_current_connections , thrift_connections_total 有这两个指标的监控吗

需要Follower的

但是我从监控上来看,connection到没那么多来着

好的,稍等一下哈

fe-0820-0.log.zip (8.4 MB)

fe-0820-1.log.zip (5.6 MB)

我来查一下