【详述】超过mem_limit限制BE节点宕机
【背景】原来没有限制BE内存,但是打满机器内存后被系统kill,现在通过mem_limit限制BE内存适用,但是超过mem_limit限制BE节点宕机
【业务影响】影响正常业务进行
【StarRocks版本】2.3.8
【集群规模】3fe(3 follower)+5be
【联系方式】15623937986
【附件】
I0302 14:06:17.068173 40987 daemon.cpp:184] Current memory statistics: process(50700279160), query_pool(102608), load(43529880768), tablet_meta(838856192), compaction(0), schema_change(0), column_pool(3382830880), page_cache(0), update(13851223527), chunk_allocator(2949488), clone(0), consistency(0)
I0302 14:06:18.724346 42492 internal_service.cpp:191] exec plan fragment, fragment_instance_id=5924f81a-b8c0-11ed-b229-d4f5ef301411, coord=TNetworkAddress(hostname=10.22.13.235, port=19020), backend=0, is_pipeline=0, chunk_size=4096
W0302 14:06:18.724455 42492 internal_service.cpp:152] exec plan fragment failed, errmsg=Memory of process exceed limit. Start execute plan fragment. Used: 50969849960, Limit: 48573096836. Mem usage has exceed the limit of BE
I0302 14:06:18.725368 42452 internal_service.cpp:235] cancel fragment, fragment_instance_id=5924f81a-b8c0-11ed-b229-d4f5ef301411, reason: InternalError
I0302 14:06:18.787642 42412 data_dir.cpp:531] path: /data/data1/starrocks1 total capacity: 3998831599616, available capacity: 1049429594112
I0302 14:06:18.787679 42412 data_dir.cpp:531] path: /data/data10/starrocks10 total capacity: 3998831599616, available capacity: 1102276276224
I0302 14:06:18.787693 42412 data_dir.cpp:531] path: /data/data11/starrocks11 total capacity: 3998831599616, available capacity: 1757440593920
I0302 14:06:18.787706 42412 data_dir.cpp:531] path: /data/data12/starrocks12 total capacity: 3998831599616, available capacity: 1627159867392
I0302 14:06:18.787753 42412 data_dir.cpp:531] path: /data/data2/starrocks2 total capacity: 3998831599616, available capacity: 1112410374144
I0302 14:06:18.787768 42412 data_dir.cpp:531] path: /data/data3/starrocks3 total capacity: 3998831599616, available capacity: 1017530736640
I0302 14:06:18.787779 42412 data_dir.cpp:531] path: /data/data4/starrocks4 total capacity: 3998831599616, available capacity: 1736039952384
I0302 14:06:18.787792 42412 data_dir.cpp:531] path: /data/data5/starrocks5 total capacity: 3998831599616, available capacity: 449360437248
I0302 14:06:18.787802 42412 data_dir.cpp:531] path: /data/data6/starrocks6 total capacity: 3998831599616, available capacity: 1163107180544
I0302 14:06:18.787814 42412 data_dir.cpp:531] path: /data/data7/starrocks7 total capacity: 3998831599616, available capacity: 1057337352192
I0302 14:06:18.787828 42412 data_dir.cpp:531] path: /data/data8/starrocks8 total capacity: 3998831599616, available capacity: 424568508416
I0302 14:06:18.787838 42412 data_dir.cpp:531] path: /data/data9/starrocks9 total capacity: 3998831599616, available capacity: 1073745293312
I0302 14:06:19.067405 44836 primary_index.cpp:78] primary_index large alloc 859832298=687865832
I0302 14:06:19.833743 42926 task_worker_pool.cpp:266] success to submit task. type=CLONE, signature=[27061679,27057322,27048856,24355472], task_count_in_queue=4
I0302 14:06:19.833773 42407 task_worker_pool.cpp:1103] get clone task. signature:27061679
I0302 14:06:19.833796 42407 engine_storage_migration_task.cpp:49] Already existed path. tablet_id=27061679, dest_store=/data/data4/starrocks4
I0302 14:06:19.833799 42407 task_worker_pool.cpp:1129] storage migrate success. status:OK, signature:27061679
I0302 14:06:19.833804 42407 tablet_manager.cpp:860] Reporting tablet info. tablet_id=27061679
I0302 14:06:19.833843 42406 task_worker_pool.cpp:1103] get clone task. signature:27057322
I0302 14:06:19.833863 42406 engine_storage_migration_task.cpp:49] Already existed path. tablet_id=27057322, dest_store=/data/data11/starrocks11
I0302 14:06:19.833866 42406 task_worker_pool.cpp:1129] storage migrate success. status:OK, signature:27057322
I0302 14:06:19.833870 42406 tablet_manager.cpp:860] Reporting tablet info. tablet_id=27057322
I0302 14:06:19.833868 42408 task_worker_pool.cpp:1103] get clone task. signature:27048856
I0302 14:06:19.833886 42408 engine_storage_migration_task.cpp:49] Already existed path. tablet_id=27048856, dest_store=/data/data4/starrocks4
I0302 14:06:19.833892 42408 task_worker_pool.cpp:1129] storage migrate success. status:OK, signature:27048856
I0302 14:06:19.833897 42408 tablet_manager.cpp:860] Reporting tablet info. tablet_id=27048856
I0302 14:06:19.843379 42408 task_worker_pool.cpp:1103] get clone task. signature:24355472
I0302 14:06:19.843395 42408 engine_storage_migration_task.cpp:49] Already existed path. tablet_id=24355472, dest_store=/data/data12/starrocks12
I0302 14:06:19.843400 42408 task_worker_pool.cpp:1129] storage migrate success. status:OK, signature:24355472
I0302 14:06:19.843402 42408 tablet_manager.cpp:860] Reporting tablet info. tablet_id=24355472
I0302 14:06:19.859230 41703 primary_index.cpp:78] primary_index large alloc 859832298=687865832
I0302 14:06:19.931133 42483 socket.cpp:2201] Checking Socket{id=386 addr=10.22.13.240:18060} (0xd5480400)
I0302 14:06:20.855073 42587 stream_load.cpp:208] new income streaming load request.id=12475e408ba2796d-1c244d345f26c9b6, job_id=-1, txn_id: -1, label=641bfe56-37ef-4834-bf2a-8073cf01e86c, db=mir_awr_dw, db=mir_awr_dw, tbl=mir_vs_hist_active_sess_history
W0302 14:06:20.861460 42587 stream_load.cpp:509] plan streaming load failed. errmsg=Tablet lost replicas. Check if any backend is down or not. tablet_id: 9478811, backends: 10.22.13.236,10.22.13.238,10.22.13.239id=12475e408ba2796d-1c244d345f26c9b6, job_id=-1, txn_id: 12041389, label=641bfe56-37ef-4834-bf2a-8073cf01e86c, db=mir_awr_dw
I0302 14:06:21.571352 40986 daemon.cpp:88] Released 8388608 bytes from column pool
I0302 14:06:23.111634 42535 internal_service.cpp:191] exec plan fragment, fragment_instance_id=5bc2600d-b8c0-11ed-b084-d4f5ef304bc2, coord=TNetworkAddress(hostname=10.22.13.233, port=19020), backend=0, is_pipeline=1, chunk_size=4096
W0302 14:06:23.111768 42535 internal_service.cpp:152] exec plan fragment failed, errmsg=Memory of process exceed limit. Start execute plan fragment. Used: 52171477688, Limit: 48573096836. Mem usage has exceed the limit of BE
FE和BE有混部吗?
看到这个情况可以加一下这些设置
net.ipv4.tcp_abort_on_overflow=1
net.core.somaxconn=10240
vm.overcommit_memory=1
没有混部。我是在测试环境测试,原以为mem_limit会限制BE内存,并且BE节点不挂。这样看设不设置这个参数没什么用了
be.out也发一下
当前是用多少内存测试的
/proc/sys/vm/overcommit_memory 这个一定要设置为1
试过,设置了,也还是一样的问题。
这个问题有解决么?我也遇到了
发下be.out日志 什么版本?
遇到同样的问题,StarRocks 版本:2.3.7
我们这个问题还能没解决,你们解决了么?
调过be.conf的load_mem_limit?
我们也是同样的版本同样的问题,而且发现现在的版本没有以前的好用,最初1.18的版本,升到2.2.2,pipeline问题不断,查询速度变慢,升到2.3.8动不动出现楼主一样的问题,同样的机器同样的配置,着实有点不理解,版本发的那么快,有没有经过充分的测试?相比较hadoop集群和clickhouse,可用性和稳定性差了一点。恕我直言,hive的源码看着非常舒服,sr里面fe的类GlobalStateMgr成员变量看的我头疼,数数有多少个,你们是真的在做世界上极速的分析引擎吗?
be.WARNING一直报这种错误,咋回事,可以帮忙看看吗
local tablet migration failed. status: Already exist: tablet_meta already exist. tablet: 9125678.132572125.a4485de49748f1a0-7bef62a25d64e9bd, signature: 9125678