be连续挂掉

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】否
【StarRocks版本】例如:3.1.4
【集群规模】例如:3fe(3 follower)+8be(fe与be混部)
【机器信息】12C 64G
【联系方式】
【附件】

  • fe.log/

I1207 12:39:25.635740 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377174, task_count_in_queue=26552
I1207 12:39:25.635756 30671 tablet_manager.cpp:379] Start to drop tablet 124377182
I1207 12:39:25.635797 30671 tablet_manager.cpp:424] Succeed to drop tablet 124377182
I1207 12:39:25.636188 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377182, task_count_in_queue=26551
I1207 12:39:25.636202 30671 tablet_manager.cpp:379] Start to drop tablet 124377246
I1207 12:39:25.636242 30671 tablet_manager.cpp:424] Succeed to drop tablet 124377246
I1207 12:39:25.636592 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377246, task_count_in_queue=26550
I1207 12:39:25.636610 30671 tablet_manager.cpp:379] Start to drop tablet 124377258
I1207 12:39:25.636670 30671 tablet_manager.cpp:424] Succeed to drop tablet 124377258
I1207 12:39:25.654086 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377258, task_count_in_queue=26549
I1207 12:39:25.654143 30671 tablet_manager.cpp:379] Start to drop tablet 124377270
I1207 12:39:25.654224 30671 tablet_manager.cpp:424] Succeed to drop tablet 124377270
I1207 12:39:25.658403 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377270, task_count_in_queue=26548
I1207 12:39:25.658432 30671 tablet_manager.cpp:379] Start to drop tablet 124377278
I1207 12:39:25.658504 30671 tablet_manager.cpp:424] Succeed to drop tablet 124377278
I1207 12:39:25.665910 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124377278, task_count_in_queue=26547
I1207 12:39:25.665942 30671 tablet_manager.cpp:379] Start to drop tablet 124758529
I1207 12:39:25.666049 30671 tablet_manager.cpp:424] Succeed to drop tablet 124758529
I1207 12:39:25.672788 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124758529, task_count_in_queue=26546
I1207 12:39:25.672822 30671 tablet_manager.cpp:379] Start to drop tablet 124758537
I1207 12:39:25.672905 30671 tablet_manager.cpp:424] Succeed to drop tablet 124758537
I1207 12:39:25.677762 30671 agent_task.cpp:149] Remove task success. type=DROP, signature=124758537, task_count_in_queue=26545
I1207 12:39:25.853847 27077 rowset_merger.cpp:252] compaction merge finished. tablet=60993061 #key=2 algorithm=VERTICAL_COMPACTION column_group_size=5 input(entry=3 rows=26714 del=0 actual=26714 bytes=3.95 MB) output(rows=26714 chunk=8 bytes=3.97 MB) duration: 701ms
I1207 12:39:25.858350 27077 tablet_updates.cpp:1827] commit compaction tablet:60993061 version:32386.1 rowset:4698 #seg:1 #row:26714 size:3.97 MB #pending:0 state_memory:720.71 KB
I1207 12:39:25.876853 32740 tablet_updates.cpp:1856] apply_compaction_commit start tablet:60993061 version:32386.1 rowset:4698
I1207 12:39:25.995950 32740 tablet_updates.cpp:2035] apply_compaction_commit finish tablet:60993061 version:32386.1 total del/row:0/26714 0% rowset:4698 #row:26714 #del:0 #delvec:1 duration:119ms(0/5/114)
I1207 12:39:26.343256 27077 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=60993057 highest_score=801161304
I1207 12:39:26.343302 27077 tablet_updates.cpp:2361] update compaction start tablet:60993057 version:32386 score:801161344 pick:3/valid:3/all:3 4574,4575,4576 #segments:1 #rows:26823->26823 bytes:3.95 MB->3.95 MB(estimate)
I1207 12:39:26.804563 27077 rowset_merger.cpp:252] compaction merge finished. tablet=60993057 #key=2 algorithm=VERTICAL_COMPACTION column_group_size=5 input(entry=3 rows=26823 del=0 actual=26823 bytes=3.95 MB) output(rows=26823 chunk=7 bytes=3.98 MB) duration: 461ms
I1207 12:39:26.807950 27077 tablet_updates.cpp:1827] commit compaction tablet:60993057 version:32386.1 rowset:4577 #seg:1 #row:26823 size:3.98 MB #pending:0 state_memory:721.56 KB
I1207 12:39:26.808264 32742 tablet_updates.cpp:1856] apply_compaction_commit start tablet:60993057 version:32386.1 rowset:4577
I1207 12:39:26.887727 32742 tablet_updates.cpp:2035] apply_compaction_commit finish tablet:60993057 version:32386.1 total del/row:0/26823 0% rowset:4577 #row:26823 #del:0 #delvec:1 duration:80ms(0/5/75)
I1207 12:39:26.891744 26831 data_consumer.cpp:73] init kafka consumer with group id: load_crawler_jd_period_d9b53a0b-72f5-4e83-ab38-af60edd30d09
I1207 12:39:27.132047 27077 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=60993041 highest_score=801156967
I1207 12:39:27.132143 27077 tablet_updates.cpp:2361] update compaction start tablet:60993041 version:32386 score:801156992 pick:3/valid:3/all:3 4682,4683,4684 #segments:1 #rows:26715->26715 bytes:3.96 MB->3.96 MB(estimate)
I1207 12:39:27.375698 27077 rowset_merger.cpp:252] compaction merge finished. tablet=60993041 #key=2 algorithm=VERTICAL_COMPACTION column_group_size=5 input(entry=3 rows=26715 del=0 actual=26715 bytes=3.96 MB) output(rows=26715 chunk=7 bytes=3.96 MB) duration: 243ms
I1207 12:39:27.380985 27077 tablet_updates.cpp:1827] commit compaction tablet:60993041 version:32386.1 rowset:4685 #seg:1 #row:26715 size:3.96 MB #pending:0 state_memory:720.72 KB
I1207 12:39:27.381091 32742 tablet_updates.cpp:1856] apply_compaction_commit start tablet:60993041 version:32386.1 rowset:4685
I1207 12:39:27.458253 32742 tablet_updates.cpp:2035] apply_compaction_commit finish tablet:60993041 version:32386.1 total del/row:0/26715 0% rowset:4685 #row:26715 #del:0 #delvec:1 duration:77ms(0/4/73)
I1207 12:39:27.770390 27077 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=60993073 highest_score=801155922
I1207 12:39:27.770429 27077 tablet_updates.cpp:2361] update compaction start tablet:60993073 version:32386 score:801155904 pick:3/valid:3/all:3 6018,6019,6020 #segments:1 #rows:26822->26822 bytes:3.96 MB->3.96 MB(estimate)
I1207 12:39:27.787901 27104 local_tablets_channel.cpp:569] LocalTabletsChannel txn_id: 70119567 load_id: 6bd165b3-dcbd-42bb-b157-60064dd7dd2b open 1 delta writer: [60992922:1] 0 failed_tablets: _num_remaining_senders: 1
I1207 12:39:27.860742 26831 data_consumer.cpp:73] init kafka consumer with group id: loadthirdhub_tradesold_xhs_normal_fa2b9a71-a750-41a3-91c5-c430f4f802ce
I1207 12:39:27.941299 27104 local_tablets_channel.cpp:591] cancel LocalTabletsChannel txn_id: 70119567 load_id: 6bd165b3dcbd42bb-b15760064dd7dd2b index_id: 60992913 #tablet:1 tablet_ids:60992922
I1207 12:39:28.135890 27077 rowset_merger.cpp:252] compaction merge finished. tablet=60993073 #key=2 algorithm=VERTICAL_COMPACTION column_group_size=5 input(entry=3 rows=26822 del=0 actual=26822 bytes=3.96 MB) output(rows=26822 chunk=7 bytes=3.97 MB) duration: 365ms
I1207 12:39:28.139605 27077 tablet_updates.cpp:1827] commit compaction tablet:60993073 version:32386.1 rowset:6021 #seg:1 #row:26822 size:3.97 MB #pending:0 state_memory:721.55 KB
I1207 12:39:28.163888 32757 tablet_updates.cpp:1856] apply_compaction_commit start tablet:60993073 version:32386.1 rowset:6021
I1207 12:39:28.279156 32757 tablet_updates.cpp:2035] apply_compaction_commit finish tablet:60993073 version:32386.1 total del/row:0/26822 0% rowset:6021 #row:26822 #del:0 #delvec:1 duration:115ms(0/4/111)
I1207 12:39:28.586530 27077 tablet_manager.cpp:759] Found the best tablet to compact. compaction_type=update tablet_id=60993268 highest_score=753146257
I1207 12:39:28.586571 27077 tablet_updates.cpp:2361] update compaction start tablet:60993268 version:46544 score:753146240 pick:3/valid:3/all:3 2646,2647,2648 #segments:2 #rows:94448->94447 bytes:49.75 MB->49.75 MB(estimate)
I1207 12:39:28.796818 27055 local_tablets_channel.cpp:569] LocalTabletsChannel txn_id: 70119568 load_id: 22e5ac43-9fa3-4b3c-8a4b-a885e3d6c25b open 1 delta writer: [60992899:1] 0 failed_tablets: _num_remaining_senders: 1
I1207 12:39:28.818688 27056 local_tablets_channel.cpp:591] cancel LocalTabletsChannel txn_id: 70119568 load_id: 22e5ac439fa34b3c-8a4ba885e3d6c25b index_id: 60992830 #tablet:1 tablet_ids:60992899
I1207 12:39:28.826714 26831 data_consumer.cpp:73] init kafka consumer with group id: loadthirdhub_wph_purchase_sale_inventory_split_v3_fe1f2162-2324-415b-b749-e8a8b3fdf061
I1207 12:39:29.025642 26831 data_consumer.cpp:73] init kafka consumer with group id: load_crawler_jd_customer_3fe76e19-0432-4178-9d7a-3343b20a2e5d
I1207 12:39:29.763940 26541 daemon.cpp:211] Current memory statistics: process(31987513800), query_pool(21453487784), load(67697208), metadata(1542289390), compactioI1207 13:40:51.549101 4455 daemon.cpp:305] version 3.1.4-0c4b2a3

BE.OUT
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Tue Dec 5 17:05:23 CST 2023
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Tue Dec 5 23:23:01 CST 2023
start time: Tue Dec 5 23:56:34 CST 2023
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Thu Dec 7 10:57:25 CST 2023
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
start time: Thu Dec 7 13:40:51 CST 2023

是某个be还是be轮流重启,be.out 日志请发下

一台be上午十点多挂了,启动之后,12点多又挂了

dmesg -T | grep starrocks 有结果不,这个是新出现的问题么?故障前集群有进行哪些变更么?show backends 请也发下,连接fe leader 节点执行下

这个BE是混合部署的吧?

是的 这台是混布的 fe是follower

混合部署的你在BE要把其他应用的内存预留出来,需要修改 be/conf mem_limit = xxx https://docs.starrocks.io/zh/docs/administration/Configuration/#mem_limit

我们修改了80%了预留了部分内存了,这个报错是内存不够引起的吗

你总共多大内存

总共64 而且这台就是个follower

你把dmesg拿出来,我看看其他应用用了多少

dmesg上面看BE用了35G就被kill了,估计其他用的太多了

dmesg.txt (195.7 KB)

dmesg.txt (195.7 KB)


这里 BE 用了 8789289*4/1024/1024 33.5G 这个java应用用了 19.5G 我建议把这个应用迁移走

好的 感谢感谢