2.1.12 升级 2.3.7后 集群scan bytes 居高不下

【背景】由2.1.12版本直接升级2.3.7
【业务影响】大任务无法运行成功
【StarRocks版本】2.3.7 由2.1.12版本直接升级2.3.7
【集群规模】例如:3fe+12be
【机器信息】16c 128G /10G网卡 16T ssd
【联系方式】微信号: zzDuke1688
【业务背景】集群中表多为 unique key/primary key 的大表(>1T) 导入大部分为stream load
【附件】
如图:升级重启节点之后 be 负载一直很高 不知道是什么原因 导致是否为正常现象:




be.warnning 中有大量 日志如下:


W0111 07:32:31.503943 2455548 fragment_context.cpp:19] [Driver] Canceled, query_id=1ba976c0-9182-11ed-afbc-0242f4bd4a45, instance_id=1ba976c0-9182-11ed-afbc-0
242f4bd4a51, reason=Cancelled: LimitReach
W0111 07:32:33.479714 2455553 fragment_context.cpp:19] [Driver] Canceled, query_id=1abfd984-9182-11ed-a7d5-02369d384f1f, instance_id=1abfd984-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:32:48.433082 2455526 fragment_context.cpp:19] [Driver] Canceled, query_id=22fa0546-9182-11ed-a7d5-02369d384f1f, instance_id=22fa0546-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:02.520313 2455538 fragment_context.cpp:19] [Driver] Canceled, query_id=2a98db68-9182-11ed-a7d5-02369d384f1f, instance_id=2a98db68-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:15.870564 2455547 fragment_context.cpp:19] [Driver] Canceled, query_id=334f150b-9182-11ed-a7d5-02369d384f1f, instance_id=334f150b-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:43.037076 2455534 fragment_context.cpp:19] [Driver] Canceled, query_id=4655892e-9182-11ed-afbc-0242f4bd4a45, instance_id=4655892e-9182-11ed-afbc-0
242f4bd4a51, reason=Cancelled: LimitReach
W0111 07:33:43.257545 2455532 fragment_context.cpp:19] [Driver] Canceled, query_id=437ffaec-9182-11ed-a7d5-02369d384f1f, instance_id=437ffaec-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:10.653262 2455544 fragment_context.cpp:19] [Driver] Canceled, query_id=547c2013-9182-11ed-a7d5-02369d384f1f, instance_id=547c2013-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:24.168982 2455525 fragment_context.cpp:19] [Driver] Canceled, query_id=5b896494-9182-11ed-a7d5-02369d384f1f, instance_id=5b896494-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:37.188655 2455532 fragment_context.cpp:19] [Driver] Canceled, query_id=63c4eeeb-9182-11ed-a7d5-02369d384f1f, instance_id=63c4eeeb-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:51.820116 2455527 fragment_context.cpp:19] [Driver] Canceled, query_id=6c5689dc-9182-11ed-a7d5-02369d384f1f, instance_id=6c5689dc-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:06.186728 2455533 fragment_context.cpp:19] [Driver] Canceled, query_id=74ca1523-9182-11ed-a7d5-02369d384f1f, instance_id=74ca1523-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:19.633118 2455526 fragment_context.cpp:19] [Driver] Canceled, query_id=7d083884-9182-11ed-a7d5-02369d384f1f, instance_id=7d083884-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:34.016834 2455534 fragment_context.cpp:19] [Driver] Canceled, query_id=887d6a75-9182-11ed-aaff-024216bb15cf, instance_id=887d6a75-9182-11ed-aaff-0
24216bb15db, reason=Cancelled: LimitReach
W0111 07:38:46.861287 2455542 fragment_context.cpp:19] [Driver] Canceled, query_id=fac20abe-9182-11ed-aaff-024216bb15cf, instance_id=fac20abe-9182-11ed-aaff-0
24216bb15dc, reason=Cancelled: UserCancel
W0111 07:38:46.861518 2455550 fragment_context.cpp:19] [Driver] Canceled, query_id=fac20abe-9182-11ed-aaff-024216bb15cf, instance_id=fac20abe-9182-11ed-aaff-0
24216bb15db, reason=Cancelled: UserCancel
W0111 07:43:59.635890 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97243182 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97243182 #version:195 [81240 81422@194 81422] #pending:0 request_version:81423,
W0111 07:44:00.636166 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97080722 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97080722 #version:111 [181794 181892@110 181892] #pending:0 request_version:181893,
W0111 07:44:01.636880 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97314864 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97314864 #version:195 [91941 92123@194 92123] #pending:0 request_version:92124,
W0111 07:44:02.636570 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97243166 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97243166 #version:195 [81240 81422@194 81422] #pending:0 request_version:81423,
W0111 07:44:02.636588 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97080658 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97080658 #version:111 [181794 181892@110 181892] #pending:0 request_version:181893,
W0111 07:44:03.636886 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97318868 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97318868 #version:195 [91942 92124@194 92124] #pending:0 request_version:92125,
W0111 07:46:40.203774 2455547 fragment_context.cpp:19] [Driver] Canceled, query_id=1591a3cb-9184-11ed-aaff-024216bb15cf, instance_id=1591a3cb-9184-11ed-aaff-024216bb15db, reason=Cancelled: LimitReach
W0111 07:47:47.675930 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114237 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114237 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344,
W0111 07:47:48.676335 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114221 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114221 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344,
W0111 07:47:49.676908 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114205 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114205 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344

be.info:

I0111 08:24:01.114723 2455620 txn_manager.cpp:204] Commit txn successfully.  tablet: 101969445, txn_id: 232245965, rowsetid: 02000000006d7b659f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.116550 2455062 txn_manager.cpp:204] Commit txn successfully.  tablet: 101969397, txn_id: 232245965, rowsetid: 02000000006d7b629f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.116703 2455067 txn_manager.cpp:204] Commit txn successfully.  tablet: 101969429, txn_id: 232245965, rowsetid: 02000000006d7b649f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.135730 2455612 txn_manager.cpp:204] Commit txn successfully.  tablet: 101969381, txn_id: 232245965, rowsetid: 02000000006d7b619f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.202814 2462606 task_worker_pool.cpp:202] Submit task success. type=PUBLISH_VERSION, signature=232245965, task_count_in_queue=1
I0111 08:24:01.202824 2455464 task_worker_pool.cpp:869] get publish version task, signature:232245965 txn_id: 232245965 priority queue size: 1
I0111 08:24:01.202912 2455468 tablet_updates.cpp:479] commit rowset tablet:101969381 version:98 txn_id: 232245965 02000000006d7b619f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11366 size:2.95 MB #pending:0
I0111 08:24:01.202930 2455468 engine_publish_version_task.cpp:60] Publish txn success tablet:101969381 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b619f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.202987 2455476 tablet_updates.cpp:479] commit rowset tablet:101969397 version:98 txn_id: 232245965 02000000006d7b629f4d26ff72ba53e910c3c3574458819b rowset:104 #seg:1 #delfile:0 #row:11085 size:2.78 MB #pending:0
I0111 08:24:01.203002 2455476 engine_publish_version_task.cpp:60] Publish txn success tablet:101969397 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b629f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203037 2455480 tablet_updates.cpp:479] commit rowset tablet:101969413 version:98 txn_id: 232245965 02000000006d7b639f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11185 size:2.83 MB #pending:0
I0111 08:24:01.203085 2455480 engine_publish_version_task.cpp:60] Publish txn success tablet:101969413 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b639f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203119 2455474 tablet_updates.cpp:479] commit rowset tablet:101969429 version:98 txn_id: 232245965 02000000006d7b649f4d26ff72ba53e910c3c3574458819b rowset:104 #seg:1 #delfile:0 #row:11302 size:2.91 MB #pending:0
I0111 08:24:01.203189 2455474 engine_publish_version_task.cpp:60] Publish txn success tablet:101969429 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b649f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203225 2455472 tablet_updates.cpp:479] commit rowset tablet:101969445 version:98 txn_id: 232245965 02000000006d7b659f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11536 size:2.96 MB #pending:0
I0111 08:24:01.203290 2455472 engine_publish_version_task.cpp:60] Publish txn success tablet:101969445 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b659f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203301 2455464 task_worker_pool.cpp:797] Publish version on partition. partition: 101969376, txn_id: 232245965, version: 98
I0111 08:24:01.203310 2455464 task_worker_pool.cpp:898] publish_version success. signature:232245965 txn_id: 232245965 related tablet num: 5 time: 0ms
I0111 08:24:01.208277 2488900 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969429 version:98 txn_id: 232245965 total del/row:0/801684 0% rowset:104 #seg:1 #op(upsert:11302 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208402 2488901 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969445 version:98 txn_id: 232245965 total del/row:0/798896 0% rowset:105 #seg:1 #op(upsert:11536 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208578 2488894 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969381 version:98 txn_id: 232245965 total del/row:0/801906 0% rowset:105 #seg:1 #op(upsert:11366 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208597 2488899 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969413 version:98 txn_id: 232245965 total del/row:0/801266 0% rowset:105 #seg:1 #op(upsert:11185 del:0) #del:0+0=0 #dv:1 duration:6ms(0/0/5/0/1)
I0111 08:24:01.208832 2488895 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969397 version:98 txn_id: 232245965 total del/row:0/800772 0% rowset:104 #seg:1 #op(upsert:11085 del:0) #del:0+0=0 #dv:1 duration:6ms(0/0/6/0/0)

奇怪的是监控中be scan bytes很大 be scan rows 却没有太大变化


be cpu/mem :升级后,cpu 使用率提高了大约 20%


截止到当前 依旧没有降下来
计划 明天回退

您这边主要的问题是升级之后scan bytes是一直居高不下是吗?看你的be.warnning日志里面出现了很多因为内存到限制导致查询失败的现象,你有很多大的或者复杂的查询?

是的,升级之后 scan byte一直居高不下
这个时间点没有大查询,只有些写入任务,写入任务是实时一直在的

主键模型表有没有使用索引落盘的功能?SHOW PROC “/statistic”;看下集群有没有不健康副本数?导入频率高不高?通过监控看下compaction score值高不高?

从2.1.12升级上来的,主键模型的参数 enable_persistent_index 没有做配置 是默认 false

导入频率不高的:
image

不健康的副本也没有:

磁盘IO的监控您方便发下吗?有没有观察升级前后磁盘IO的情况?

磁盘是NVME 的ssd 基本没有IO
image
截止到当前依旧很高,集群中大的数仓任务已经无法正常运行了

计划回退到2.2.10 这个版本有类似的情况吗
我是从2.1.12升级到 2.3.7,是否能直接回退2.2.10 ?

观察出现问题的时间点是fe升级重启之后,be先升级的 be单独升级完好像没有这个情况

麻烦看下集群现在pipeline有没有开启,2.3默认是开启的,如果开着的话你先关掉再观察看看

执行:
set global enable_pipeline_engine=false;
admin set frontend config (“enable_statistic_collect”=“false”);
之后看起来恢复了,不过目前还是比升级前要高,而且稳定性貌似不如2.1.12,之前版本能跑出来的大查询,新版本无法执行成功,报内存超出限制

大查询的sql一致吗?报超出内存限制是把pipeline关掉之后吗?

是的,同样的任务 ,现在关掉pipe line 执行依旧不成功。

不成功的原因是啥,超时了?

超内存。 能执行成功的大SQL执行速度也明显变慢

ERROR 1064 (HY000): Memory of process exceed limit. try consume:11862016 Used: 74235454456, Limit: 115274475785. Mem usage has exceed the limit of BE

比如下面这个任务的执行时长:由10s 变成 40s + 而且有一定概率报超内存 (11号下午升级到2.3.7, 12日早晨关闭pipeline)

同一个sql能麻烦您跑下开启pipeline和关闭pipeline的profile看下吗?10s和40s+那个sql也可以,需要根据profile看下差异点在哪里

fe queries 的页面里 好像只有成功的SQL ,找不到失败的SQL,而且现在不太方便开pipeline ,这是线上环境开启pipeline 应该还会出现 scan byte 过高,会影响使用

而且我发现 现在大查询运行的时候 只有一个或者两个be的内存会使用特别高,然后一直到limit 被kill
之前这种查询运行的时候,集群共12 个be 内存都会同步上升,现在内存使用都在个别节点上,执行策略不太对感觉

目前计划降级到2.2.10 观察下

是升级集群前后同样的sql吗?查询历史数据(非升级后导入数据的表)也会有这种情况的发生吗?您观察到的这个现象是打开pipeline的时候?

如果您这边还没有进行集群回退的话可以麻烦您做下这些操作发下截图看看吗?
show variables like ‘parallel_fragment_exec_instance_num’;
show variables like ‘pipeline_dop’;
还有be机器的核数: show backends \G;里面可以看到CPU cores

目前已经降级到 2.2.10 并关闭pipeline 大查询都可以正常跑出来了
这是2.2.10的参数:

12台be节点配置一致的: