be内存暴涨至崩溃

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.2.2
【集群规模】例如:4fe(2 follower+1observer)+6be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】

  • fe.log/beINFO/相应截图
  • 慢查询:
    • Profile信息
    • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
    • pipeline是否开启:show variables like ‘%pipeline%’;
    • be节点cpu和内存使用率截图
  • 查询报错:
  • be crash
    • be.out
  • 外表查询报错
    • be.out和fe.warn.log
      Memory of process exceed limit. Start execute plan fragment. Used: 222847457536, Limit: 206486153809. Mem usage has exceed the limit of BE backend [id=10005]


update_apply线程cpu使用率高

没开主键索引落盘?基本都是主键索引占用,
主键表最佳实践:
1.创建分区表
2.数据有明显冷热特性,经常更新最近7天或者1个月的数据
3.开启主键索引落盘,磁盘最好是SSD

image

感谢!
已通过ALTER TABLE *** SET (“enable_persistent_index”=“true”);修改所有表,现象依旧。
还需要打开别的开关吗

curl -XGET -s http://be_ip:8040/metrics | grep “^starrocks_be_.*_mem_bytes|^starrocks_be_tcmalloc_bytes_in_use”

看下这个输出

image


现在是update_apply吃内存

参考 https://starrocks.feishu.cn/docx/U67gdxuHgozWVPxzekLcLs6Onid 拿下heap看下

digraph “/starrocks/be/lib/starrocks_be; 3254.4 MB” {
node [width=0.375,height=0.25];
Legend [shape=box,fontsize=24,shape=plaintext,label="/starrocks/be/lib/starrocks_be\lTotal MB: 3254.4\lFocusing on: 3254.4\lDropped nodes with <= 16.3 abs(MB)\lDropped edges with <= 3.3 MB\l"];
N1 [label="__clone\n0.0 (0.0%)\rof 3252.9 (100.0%)\r",shape=box,fontsize=8.0];
N2 [label=“start_thread\n0.0 (0.0%)\rof 3252.9 (100.0%)\r”,shape=box,fontsize=8.0];
N3 [label=“starrocks\nPrimaryIndex\n_do_load\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N4 [label=“starrocks\nPrimaryIndex\nload\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N5 [label=“starrocks\nTabletUpdates\n_apply_compaction_commit\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N6 [label=“starrocks\nTabletUpdates\ndo_apply\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N7 [label=“starrocks\nThread\nsupervise_thread\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N8 [label=“starrocks\nThreadPool\ndispatch_thread\n0.0 (0.0%)\rof 3125.4 (96.0%)\r”,shape=box,fontsize=8.0];
N9 [label=“starrocks\nPrimaryIndex\ninsert\n0.0 (0.0%)\rof 3124.0 (96.0%)\r”,shape=box,fontsize=8.0];
N10 [label=“starrocks\nShardByLengthSliceHashIndex\ninsert\n0.0 (0.0%)\rof 3124.0 (96.0%)\r”,shape=box,fontsize=8.0];
N11 [label=“starrocks\nSliceHashIndex\ninsert\n0.0 (0.0%)\rof 3124.0 (96.0%)\r”,shape=box,fontsize=8.0];
N12 [label=“phmap\npriv\nraw_hash_set\nprepare_insert\n0.0 (0.0%)\rof 3072.0 (94.4%)\r”,shape=box,fontsize=8.0];
N13 [label=“phmap\npriv\nraw_hash_set\nresize\n3072.0 (94.4%)\r”,shape=box,fontsize=56.6];
N14 [label=“execute_native_thread_routine\n0.0 (0.0%)\rof 127.5 (3.9%)\r”,shape=box,fontsize=8.0];
N15 [label=“starrocks\nReportOlapTableTaskWorkerPool\n_worker_thread_callback\n0.0 (0.0%)\rof 80.5 (2.5%)\r”,shape=box,fontsize=8.0];
N16 [label=“starrocks\nTabletManager\nreport_all_tablets_info\n0.0 (0.0%)\rof 80.5 (2.5%)\r”,shape=box,fontsize=8.0];
N17 [label=“std\n_Rb_tree\n_M_emplace_unique\n18.5 (0.6%)\rof 80.5 (2.5%)\r”,shape=box,fontsize=11.8];
N18 [label=“starrocks\nTTablet\nTTablet\n0.0 (0.0%)\rof 62.0 (1.9%)\r”,shape=box,fontsize=8.0];
N19 [label=“std\nvector\noperator=\n[clone\n.isra.0]\n62.0 (1.9%)\r”,shape=box,fontsize=14.9];
N20 [label=“std\n__cxx11\nbasic_string\n_M_construct\n[clone\n.constprop.0]\n52.0 (1.6%)\r”,shape=box,fontsize=14.3];
N21 [label=“apache\nthrift\nTDispatchProcessor\nprocess\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N22 [label=“apache\nthrift\nconcurrency\nThread\nthreadMain\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N23 [label=“apache\nthrift\nserver\nTConnectedClient\nrun\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N24 [label=“apache\nthrift\nserver\nTThreadedServer\nTConnectedClientRunner\nrun\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N25 [label=“starrocks\nBackendServiceProcessor\ndispatchCall\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N26 [label=“starrocks\nBackendServiceProcessor\nprocess_get_tablet_stat\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N27 [label=“starrocks\nTabletManager\n_build_tablet_stat\n47.0 (1.4%)\r”,shape=box,fontsize=14.0];
N28 [label=“starrocks\nTabletManager\nget_tablet_stat\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N29 [label=“std\nthread\n_State_impl\n_M_run\n0.0 (0.0%)\rof 47.0 (1.4%)\r”,shape=box,fontsize=8.0];
N1 -> N2 [label=3252.9, weight=100000, style=“setlinewidth(2.000000)”];
N6 -> N5 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N2 -> N7 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N5 -> N4 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N4 -> N3 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N7 -> N8 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N8 -> N6 [label=3125.4, weight=100000, style=“setlinewidth(2.000000)”];
N10 -> N11 [label=3124.0, weight=100000, style=“setlinewidth(2.000000)”];
N9 -> N10 [label=3124.0, weight=100000, style=“setlinewidth(2.000000)”];
N3 -> N9 [label=3124.0, weight=100000, style=“setlinewidth(2.000000)”];
N12 -> N13 [label=3072.0, weight=100000, style=“setlinewidth(2.000000)”];
N11 -> N12 [label=3072.0, weight=100000, style=“setlinewidth(2.000000)”];
N2 -> N14 [label=127.5, weight=100000, style=“setlinewidth(0.235099)”];
N15 -> N16 [label=80.5, weight=100000, style=“setlinewidth(0.148439)”];
N16 -> N17 [label=80.5, weight=100000, style=“setlinewidth(0.148439)”];
N14 -> N15 [label=80.5, weight=100000, style=“setlinewidth(0.148439)”];
N18 -> N19 [label=62.0, weight=100000, style=“setlinewidth(0.114328)”];
N17 -> N18 [label=62.0, weight=100000, style=“setlinewidth(0.114328)”];
N11 -> N20 [label=52.0, weight=100000, style=“setlinewidth(0.095875)”];
N26 -> N28 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N25 -> N26 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N21 -> N25 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N24 -> N23 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N22 -> N24 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N29 -> N22 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N28 -> N27 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N23 -> N21 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
N14 -> N29 [label=47.0, weight=100000, style=“setlinewidth(0.086660)”];
}

有一个大表drop造成的,已解决

通过啥方式drop的?

使用的DROP TABLE没加FORCE,主动清理后集群不再循环
primary index released
load large primary index start
load primary index finish
清理方式可以用

  • 开启vlog,找到有问题的Tablet,并使用meta_tool.sh删除这个Tablet