【稳定性】运行中的集群BE节点突然全部崩溃

【详述】集群正常运行中,突然一瞬间所有BE节点同时挂掉。在自动拉起恢复5分钟后,大部分BE节点再次瞬间挂掉;再次拉起恢复之后,整体恢复正常
从Grafana监控上看,BE的CPU、内存使用率并不高




【背景】无特殊操作,有一些数仓模型SQL读写作业正常执行中
【业务影响】集群整体不可用,数仓模型计算失败
【是否存算分离】存算一体
【StarRocks版本】3.1.5
【集群规模】2 fe(1 follower)+10 be(独立部署);FE、BE均为k8s容器化部署
【机器信息】FE规格=12核24GB,BE规格=16核128GB
【联系方式】StarRocks社区群14,CrazyRen
【附件】
查看BE日志,发现两次崩溃时都有如下异常:

<!--p.MsoNormal{ mso-style-name: 正文; mso-style-parent: ""; margin: 0pt; margin-bottom: .0001pt; mso-pagination: none; text-justify: inter-ideograph; mso-font-kerning: 1.0000pt; } p.paragraph{ mso-style-noshow: yes; margin-top: 5.0000pt; margin-right: 0.0000pt; margin-bottom: 5.0000pt; margin-left: 0.0000pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: left; font-family: 等线; mso-bidi-font-family: 'Times New Roman'; font-size: 12.0000pt; } -->
I0911 14:28:30.211582 659 local_tablets_channel.cpp:569] LocalTabletsChannel txn_id: 119951089 load_id: ce4c79ce-8d14-f28c-f51d-ae31dc653eab open 2 delta writer: [8227135:1][8227126:1] 0 failed_tablets: _num_remaining_senders: 1
I0911 14:28:30.226069 679 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=20702673 highest_score=8277215305
I0911 14:28:30.226161 679 tablet_updates.cpp:2401] update compaction start tablet:20702673 version:336450 score:8277214720 pick:31/valid:31/all:43 82245,82246,82247,82248,82249,82250,82251,82252,82253,82254,82255,82256,82257,82258,82259,82260,82261,82262,82263,82264,82265,82266,82267,82268,82269,82270,82271,82272,82273,82274,82275 #segments:31 #rows:239505->239401 bytes:42.48 MB->42.43 MB(estimate)
I0911 14:28:30.556550 659 local_tablets_channel.cpp:444] LocalTabletsChannel txn_id: 119951089 load_id: ce4c79ce-8d14-f28c-f51d-ae31dc653eab commit 2 tablets: 8227135,8227126
I0911 14:28:30.583408 302 txn_manager.cpp:298] Commit txn successfully. tablet: 8227126, txn_id: 119951089, rowsetid: 020000000000bd30f748252be898deaf604d45d29c1e8b87 #segment:1 #delfile:0 #uptfiles:0
I0911 14:28:30.589504 786 txn_manager.cpp:298] Commit txn successfully. tablet: 8227135, txn_id: 119951089, rowsetid: 020000000000bd31f748252be898deaf604d45d29c1e8b87 #segment:1 #delfile:0 #uptfiles:0
I0911 14:28:30.594940 436 fragment_executor.cpp:173] Prepare(): query_id=0fea4ce4-7007-11ef-9032-02550afe42b2 fragment_instance_id=0fea4ce4-7007-11ef-9032-02550afe42b4 is_stream_pipeline=0 backend_num=2
I0911 14:28:30.596012 703 local_tablets_channel.cpp:569] LocalTabletsChannel txn_id: 119951090 load_id: 854838a9-b7c9-91c8-338c-9ef477cc5d90 open 79 delta writer: [17514686:2][15982281:1][19731820:2][15982272:2][15982263:1][15982212:2][15982209:2][15982200:2][15982164:2][15982128:1][15982125:2][15982113:2][15982092:1][15982074:1][15982119:2][15981738:2][15981561:1][15981720:1][15981705:1][15981528:2][15981681:2][15981858:1][15982035:2][15982014:2][15981642:2][15981465:2][15981966:2][15981462:1][15981639:2][15981612:1][15981993:2][19197393:2][15981753:2][15981930:1][15981579:1][15981741:2][15981480:1][15981597:2][15982155:1][15981774:2][15981750:1][15981663:2][15981840:1][15981489:2][15981702:2][15981615:2][15982173:1][15981792:2][15981609:1][15981522:1][15981600:2][15981777:2][15981954:2][15981534:2][15982065:2][15981624:1][15981537:2][15981570:2][15981762:2][15981939:2][15981795:2][15981807:2][15981813:2][15981822:1][15981867:2][15982254:1][15981873:2][15981894:1][15981912:2][15981936:2][15981945:2][15981957:1][15981975:1][15982002:2][15982032:2][15981660:1][15982041:1][22889196:1][15982047:1] 0 failed_tablets: _num_remaining_senders: 1
W0911 14:28:30.577275 483 mem_hook.cpp:266] large memory alloc: 1897042944 bytes, stack:
@ 0x65b4642 malloc
@ 0xa0dd7bc operator new()
@ 0x3e698f4 starrocks::JoinHashTable::build()
@ 0x4312eb6 starrocks::HashJoinBuilder::build()
@ 0x430b57e starrocks::HashJoiner::build_ht()
@ 0x4306cfa starrocks::pipeline::HashJoinBuildOperator::set_finishing()
@ 0x431b22e starrocks::pipeline::SpillableHashJoinBuildOperator::set_finishing()
@ 0x3b7912e starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
@ 0x3b7b801 starrocks::pipeline::PipelineDriver::process()
@ 0x63a9e30 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x673e39b starrocks::ThreadPool::dispatch_thread()
@ 0x67385ea starrocks::Thread::supervise_thread()
@ 0x7fccfd875ac3 (unknown)
@ 0x7fccfd906bf4 clone
@ (nil) (unknown)
I0911 14:28:31.310446 679 rowset_merger.cpp:252] compaction merge finished. tablet=20702673 #key=1 algorithm=VERTICAL_COMPACTION column_group_size=44 input(entry=31 rows=239401 del=104 actual=239401 bytes=42.48 MB) output(rows=239401 chunk=134 bytes=40.27 MB) duration: 1084ms
I0911 14:28:31.316175 679 tablet_updates.cpp:1830] commit compaction tablet:20702673 version:336450.1 rowset:82276 #seg:1 #row:239401 size:40.27 MB #pending:0 state_memory:1.83 MB
I0911 14:28:31.316423 2738 tablet_updates.cpp:1859] apply_compaction_commit start tablet:20702673 version:336450.1 rowset:82276
I0911 14:28:31.319859 703 local_tablets_channel.cpp:569] LocalTabletsChannel txn_id: 119951091 load_id: ea4f12c5-61fc-e383-db7e-548c644cdaa0 open 2 delta writer: [5799961:2][5799946:1] 0 failed_tablets: _num_remaining_senders: 1
I0911 14:28:31.403012 2738 tablet_updates.cpp:2038] apply_compaction_commit finish tablet:20702673 version:336450.1 total del/row:2515690/42351439 5% rowset:82276 #row:239401 #del:0 #delvec:1 duration:86ms(0/85/1)
I0911 14:28:31.439132 679 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=16001219 highest_score=7813669224
I0911 14:28:31.439180 679 tablet_updates.cpp:2401] update compaction start tablet:16001219 version:74352 score:7813670400 pick:30/valid:30/all:30 19766,19767,19768,19769,19770,19771,19772,19773,19774,19775,19776,19777,19778,19779,19780,19781,19782,19783,19784,19785,19786,19787,19788,19789,19790,19791,19792,19793,19794,19795 #segments:30 #rows:2057078->2057025 bytes:230.22 MB->229.84 MB(estimate)
I0911 14:28:31.606048 790 task_worker_pool.cpp:182] Submit task success. type=PUBLISH_VERSION, signature=119951087, task_count_in_queue=1
I0911 14:28:31.606081 627 task_worker_pool.cpp:519] get publish version task txn_id: 119951087 priority queue size: 1
I0911 14:28:31.606285 620 tablet_updates.cpp:625] commit rowset tablet:5837419 version:403724 txn_id: 119951087 020000000000bd2cf748252be898deaf604d45d29c1e8b87 rowset:158437 #seg:1 #delfile:0 #uptfile:0 #row:28 size:65.29 KB #pending:0
I0911 14:28:31.606285 617 tablet_updates.cpp:625] commit rowset tablet:5837398 version:403724 txn_id: 119951087 020000000000bd2af748252be898deaf604d45d29c1e8b87 rowset:143893 #seg:1 #delfile:0 #uptfile:0 #row:33 size:69.75 KB #pending:0
I0911 14:28:31.606331 620 txn_manager.cpp:338] add txn info history. txn_id: 119951087, partition_id: 5837388, tablet_id: 5837419, schema_hash: 932111106, rowset_id: 020000000000bd2cf748252be898deaf604d45d29c1e8b87, version: 403724
I0911 14:28:31.606349 620 publish_version.cpp:125] Publish txn success tablet:5837419 version:403724 tablet_max_version:403724 partition:5837388 txn_id: 119951087 rowset:020000000000bd2cf748252be898deaf604d45d29c1e8b87
I0911 14:28:31.606285 619 tablet_updates.cpp:625] commit rowset tablet:5837416 version:403724 txn_id: 119951087 020000000000bd2bf748252be898deaf604d45d29c1e8b87 rowset:154309 #seg:1 #delfile:0 #uptfile:0 #row:30 size:65.32 KB #pending:0
I0911 14:28:31.606369 617 txn_manager.cpp:338] add txn info history. txn_id: 119951087, partition_id: 5837388, tablet_id: 5837398, schema_hash: 932111106, rowset_id: 020000000000bd2af748252be898deaf604d45d29c1e8b87, version: 403724
I0911 14:28:31.606374 617 publish_version.cpp:125] Publish txn success tablet:5837398 version:403724 tablet_max_version:403724 partition:5837388 txn_id: 119951087 rowset:020000000000bd2af748252be898deaf604d45d29c1e8b87
I0911 14:28:31.606396 619 txn_manager.cpp:338] add txn info history. txn_id: 119951087, partition_id: 5837388, tablet_id: 5837416, schema_hash: 932111106, rowset_id: 020000000000bd2bf748252be898deaf604d45d29c1e8b87, version: 403724
I0911 14:28:31.606407 619 publish_version.cpp:125] Publish txn success tablet:5837416 version:403724 tablet_max_version:403724 partition:5837388 txn_id: 119951087 rowset:020000000000bd2bf748252be898deaf604d45d29c1e8b87
I0911 14:28:31.606436 2738 tablet_updates.cpp:1610] primary index upsert tid: 5837419, cost: 65965, IOStat get_in_shard_cnt: 0 get_in_shard_cost: 0 read_io_bytes: 0 l0_write_cost: 15769 l1_l2_read_cost: 3984 flush_or_wal_cost: 26928 compaction_cost: 0 reload_meta_cost: 0
I0911 14:28:31.606456 627 publish_version.cpp:204] publish_version success. txn_id: 119951087 #partition:1 #tablet:3 time:1ms #already_finished:0
I0911 14:28:31.606633 2740 tablet_updates.cpp:1610] primary index upsert tid: 5837398, cost: 90380, IOStat get_in_shard_cnt: 0 get_in_shard_cost: 0 read_io_bytes: 0 l0_write_cost: 14074 l1_l2_read_cost: 238 flush_or_wal_cost: 22487 compaction_cost: 0 reload_meta_cost: 0
I0911 14:28:31.606699 2741 tablet_updates.cpp:1610] primary index upsert tid: 5837416, cost: 85394, IOStat get_in_shard_cnt: 0 get_in_shard_cost: 0 read_io_bytes: 0 l0_write_cost: 13649 l1_l2_read_cost: 214 flush_or_wal_cost: 41404 compaction_cost: 0 reload_meta_cost: 0
I0911 14:28:31.606946 2738 tablet_updates.cpp:1469] apply_rowset_commit finish. tablet:5837419 version:403724 txn_id: 119951087 total del/row:620/3357328 0% rowset:158437 #seg:1 #op(upsert:28 del:0) #del:567+28=595 #dv:3 duration:0ms(0/0/0/0)
I0911 14:28:31.606957 2740 tablet_updates.cpp:1469] apply_rowset_commit finish. tablet:5837398 version:403724 txn_id: 119951087 total del/row:827/3358164 0% rowset:143893 #seg:1 #op(upsert:33 del:0) #del:742+32=774 #dv:3 duration:0ms(0/0/0/0)
I0911 14:28:31.607014 2741 tablet_updates.cpp:1469] apply_rowset_commit finish. tablet:5837416 version:403724 txn_id: 119951087 total del/row:915/3357092 0% rowset:154309 #seg:1 #op(upsert:30 del:0) #del:818+30=848 #dv:2 duration:0ms(0/0/0/0)
I0911 14:28:31.608774 627 task_worker_pool.cpp:569] batch submit 1 finish publish version task txn publish task(s). #dir:0 flush:2ms
[Wed Sep 11 14:29:01 CST 2024] Last 50 lines of be.out ...
@ 0x67385ea starrocks::Thread::supervise_thread()
@ 0x7f9127e2fac3 (unknown)
@ 0x7f9127ec0bf4 clone
@ 0x0 (unknown)
start time: Wed Sep 11 14:22:55 CST 2024
3.1.5 RELEASE (build 5d8438a)
query_id:072b12ba-7007-11ef-8c93-02550afe42d7, fragment_instance:072b12ba-7007-11ef-8c93-02550afe444f
tracker:process consumption: 58516959724
tracker:query_pool consumption: 31597044645
tracker:load consumption: 342280
tracker:metadata consumption: 595889518
tracker:tablet_metadata consumption: 50746719
tracker:rowset_metadata consumption: 75333176
tracker:segment_metadata consumption: 45953215
tracker:column_metadata consumption: 423856408
tracker:tablet_schema consumption: 19028511
tracker:segment_zonemap consumption: 43640556
tracker:short_key_index consumption: 357180
tracker:column_zonemap_index consumption: 140892248
tracker:ordinal_index consumption: 127174392
tracker:bitmap_index consumption: 48800
tracker:bloom_filter_index consumption: 5568
tracker:compaction consumption: 44243296
tracker:schema_change consumption: 0
tracker:column_pool consumption: 745093025
tracker:page_cache consumption: 6554177408
tracker:update consumption: 12376237650
tracker:chunk_allocator consumption: 896535912
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1726036111 (unix time) try "date -d @1726036111" if you are using GNU date ***
PC: @ 0x7fccfd98ffcd (unknown)
*** SIGSEGV (@0x0) received by PID 25 (TID 0x7fcb539cb640) from PID 0; stack trace: ***
@ 0x7b00aaa google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fccfd823520 (unknown)
@ 0x7fccfd98ffcd (unknown)
@ 0x581b999 starrocks::JoinRuntimeFilter::serialize()
@ 0x42481ca starrocks::RuntimeBloomFilter<>::serialize()
@ 0x57fc4e1 starrocks::RuntimeFilterHelper::serialize_runtime_filter()
@ 0x6571f97 starrocks::RuntimeFilterPort::publish_runtime_filters()
@ 0x4307369 starrocks::pipeline::HashJoinBuildOperator::set_finishing()
@ 0x431b22e starrocks::pipeline::SpillableHashJoinBuildOperator::set_finishing()
@ 0x3b7912e starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
@ 0x3b7b801 starrocks::pipeline::PipelineDriver::process()
@ 0x63a9e30 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x673e39b starrocks::ThreadPool::dispatch_thread()
@ 0x67385ea starrocks::Thread::supervise_thread()
@ 0x7fccfd875ac3 (unknown)
@ 0x7fccfd906bf4 clone
@ 0x0 (unknown)

看起来好像是要申请分配大块内存时访问了非法区域

已经修复问题,建议升级到3.1最新版本。多层嵌套的join 可能有的表过大导致GlobalRuntimeFilter 超过限制导致crash。临时解决办法: set global enable_global_runtime_filter = false;

我们已升级到3.2.9,运行2天观察未复现问题
多谢老师!