【背景】由2.1.12版本直接升级2.3.7
【业务影响】大任务无法运行成功
【StarRocks版本】2.3.7 由2.1.12版本直接升级2.3.7
【集群规模】例如:3fe+12be
【机器信息】16c 128G /10G网卡 16T ssd
【联系方式】微信号: zzDuke1688
【业务背景】集群中表多为 unique key/primary key 的大表(>1T) 导入大部分为stream load
【附件】
如图:升级重启节点之后 be 负载一直很高 不知道是什么原因 导致是否为正常现象:
be.warnning 中有大量 日志如下:
W0111 07:32:31.503943 2455548 fragment_context.cpp:19] [Driver] Canceled, query_id=1ba976c0-9182-11ed-afbc-0242f4bd4a45, instance_id=1ba976c0-9182-11ed-afbc-0
242f4bd4a51, reason=Cancelled: LimitReach
W0111 07:32:33.479714 2455553 fragment_context.cpp:19] [Driver] Canceled, query_id=1abfd984-9182-11ed-a7d5-02369d384f1f, instance_id=1abfd984-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:32:48.433082 2455526 fragment_context.cpp:19] [Driver] Canceled, query_id=22fa0546-9182-11ed-a7d5-02369d384f1f, instance_id=22fa0546-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:02.520313 2455538 fragment_context.cpp:19] [Driver] Canceled, query_id=2a98db68-9182-11ed-a7d5-02369d384f1f, instance_id=2a98db68-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:15.870564 2455547 fragment_context.cpp:19] [Driver] Canceled, query_id=334f150b-9182-11ed-a7d5-02369d384f1f, instance_id=334f150b-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:33:43.037076 2455534 fragment_context.cpp:19] [Driver] Canceled, query_id=4655892e-9182-11ed-afbc-0242f4bd4a45, instance_id=4655892e-9182-11ed-afbc-0
242f4bd4a51, reason=Cancelled: LimitReach
W0111 07:33:43.257545 2455532 fragment_context.cpp:19] [Driver] Canceled, query_id=437ffaec-9182-11ed-a7d5-02369d384f1f, instance_id=437ffaec-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:10.653262 2455544 fragment_context.cpp:19] [Driver] Canceled, query_id=547c2013-9182-11ed-a7d5-02369d384f1f, instance_id=547c2013-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:24.168982 2455525 fragment_context.cpp:19] [Driver] Canceled, query_id=5b896494-9182-11ed-a7d5-02369d384f1f, instance_id=5b896494-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:37.188655 2455532 fragment_context.cpp:19] [Driver] Canceled, query_id=63c4eeeb-9182-11ed-a7d5-02369d384f1f, instance_id=63c4eeeb-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:34:51.820116 2455527 fragment_context.cpp:19] [Driver] Canceled, query_id=6c5689dc-9182-11ed-a7d5-02369d384f1f, instance_id=6c5689dc-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:06.186728 2455533 fragment_context.cpp:19] [Driver] Canceled, query_id=74ca1523-9182-11ed-a7d5-02369d384f1f, instance_id=74ca1523-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:19.633118 2455526 fragment_context.cpp:19] [Driver] Canceled, query_id=7d083884-9182-11ed-a7d5-02369d384f1f, instance_id=7d083884-9182-11ed-a7d5-0
2369d384f2b, reason=Cancelled: LimitReach
W0111 07:35:34.016834 2455534 fragment_context.cpp:19] [Driver] Canceled, query_id=887d6a75-9182-11ed-aaff-024216bb15cf, instance_id=887d6a75-9182-11ed-aaff-0
24216bb15db, reason=Cancelled: LimitReach
W0111 07:38:46.861287 2455542 fragment_context.cpp:19] [Driver] Canceled, query_id=fac20abe-9182-11ed-aaff-024216bb15cf, instance_id=fac20abe-9182-11ed-aaff-0
24216bb15dc, reason=Cancelled: UserCancel
W0111 07:38:46.861518 2455550 fragment_context.cpp:19] [Driver] Canceled, query_id=fac20abe-9182-11ed-aaff-024216bb15cf, instance_id=fac20abe-9182-11ed-aaff-0
24216bb15db, reason=Cancelled: UserCancel
W0111 07:43:59.635890 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97243182 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97243182 #version:195 [81240 81422@194 81422] #pending:0 request_version:81423,
W0111 07:44:00.636166 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97080722 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97080722 #version:111 [181794 181892@110 181892] #pending:0 request_version:181893,
W0111 07:44:01.636880 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97314864 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97314864 #version:195 [91941 92123@194 92123] #pending:0 request_version:92124,
W0111 07:44:02.636570 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97243166 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97243166 #version:195 [81240 81422@194 81422] #pending:0 request_version:81423,
W0111 07:44:02.636588 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97080658 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97080658 #version:111 [181794 181892@110 181892] #pending:0 request_version:181893,
W0111 07:44:03.636886 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97318868 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97318868 #version:195 [91942 92124@194 92124] #pending:0 request_version:92125,
W0111 07:46:40.203774 2455547 fragment_context.cpp:19] [Driver] Canceled, query_id=1591a3cb-9184-11ed-aaff-024216bb15cf, instance_id=1591a3cb-9184-11ed-aaff-024216bb15db, reason=Cancelled: LimitReach
W0111 07:47:47.675930 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114237 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114237 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344,
W0111 07:47:48.676335 2461461 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114221 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114221 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344,
W0111 07:47:49.676908 2461462 agent_server.cpp:308] fail to make_snapshot. tablet_id:97114205 msg:Not found: get_rowsets_for_snapshot: no version to clone tablet:97114205 #version:197 [55160 55343@196 55343] #pending:0 request_version:55344
be.info:
I0111 08:24:01.114723 2455620 txn_manager.cpp:204] Commit txn successfully. tablet: 101969445, txn_id: 232245965, rowsetid: 02000000006d7b659f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.116550 2455062 txn_manager.cpp:204] Commit txn successfully. tablet: 101969397, txn_id: 232245965, rowsetid: 02000000006d7b629f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.116703 2455067 txn_manager.cpp:204] Commit txn successfully. tablet: 101969429, txn_id: 232245965, rowsetid: 02000000006d7b649f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.135730 2455612 txn_manager.cpp:204] Commit txn successfully. tablet: 101969381, txn_id: 232245965, rowsetid: 02000000006d7b619f4d26ff72ba53e910c3c3574458819b #segment:1 #delfile:0
I0111 08:24:01.202814 2462606 task_worker_pool.cpp:202] Submit task success. type=PUBLISH_VERSION, signature=232245965, task_count_in_queue=1
I0111 08:24:01.202824 2455464 task_worker_pool.cpp:869] get publish version task, signature:232245965 txn_id: 232245965 priority queue size: 1
I0111 08:24:01.202912 2455468 tablet_updates.cpp:479] commit rowset tablet:101969381 version:98 txn_id: 232245965 02000000006d7b619f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11366 size:2.95 MB #pending:0
I0111 08:24:01.202930 2455468 engine_publish_version_task.cpp:60] Publish txn success tablet:101969381 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b619f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.202987 2455476 tablet_updates.cpp:479] commit rowset tablet:101969397 version:98 txn_id: 232245965 02000000006d7b629f4d26ff72ba53e910c3c3574458819b rowset:104 #seg:1 #delfile:0 #row:11085 size:2.78 MB #pending:0
I0111 08:24:01.203002 2455476 engine_publish_version_task.cpp:60] Publish txn success tablet:101969397 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b629f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203037 2455480 tablet_updates.cpp:479] commit rowset tablet:101969413 version:98 txn_id: 232245965 02000000006d7b639f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11185 size:2.83 MB #pending:0
I0111 08:24:01.203085 2455480 engine_publish_version_task.cpp:60] Publish txn success tablet:101969413 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b639f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203119 2455474 tablet_updates.cpp:479] commit rowset tablet:101969429 version:98 txn_id: 232245965 02000000006d7b649f4d26ff72ba53e910c3c3574458819b rowset:104 #seg:1 #delfile:0 #row:11302 size:2.91 MB #pending:0
I0111 08:24:01.203189 2455474 engine_publish_version_task.cpp:60] Publish txn success tablet:101969429 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b649f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203225 2455472 tablet_updates.cpp:479] commit rowset tablet:101969445 version:98 txn_id: 232245965 02000000006d7b659f4d26ff72ba53e910c3c3574458819b rowset:105 #seg:1 #delfile:0 #row:11536 size:2.96 MB #pending:0
I0111 08:24:01.203290 2455472 engine_publish_version_task.cpp:60] Publish txn success tablet:101969445 version:98 partition:101969376 txn_id: 232245965 rowset:02000000006d7b659f4d26ff72ba53e910c3c3574458819b
I0111 08:24:01.203301 2455464 task_worker_pool.cpp:797] Publish version on partition. partition: 101969376, txn_id: 232245965, version: 98
I0111 08:24:01.203310 2455464 task_worker_pool.cpp:898] publish_version success. signature:232245965 txn_id: 232245965 related tablet num: 5 time: 0ms
I0111 08:24:01.208277 2488900 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969429 version:98 txn_id: 232245965 total del/row:0/801684 0% rowset:104 #seg:1 #op(upsert:11302 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208402 2488901 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969445 version:98 txn_id: 232245965 total del/row:0/798896 0% rowset:105 #seg:1 #op(upsert:11536 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208578 2488894 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969381 version:98 txn_id: 232245965 total del/row:0/801906 0% rowset:105 #seg:1 #op(upsert:11366 del:0) #del:0+0=0 #dv:1 duration:5ms(0/0/5/0/0)
I0111 08:24:01.208597 2488899 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969413 version:98 txn_id: 232245965 total del/row:0/801266 0% rowset:105 #seg:1 #op(upsert:11185 del:0) #del:0+0=0 #dv:1 duration:6ms(0/0/5/0/1)
I0111 08:24:01.208832 2488895 tablet_updates.cpp:988] apply_rowset_commit finish. tablet:101969397 version:98 txn_id: 232245965 total del/row:0/800772 0% rowset:104 #seg:1 #op(upsert:11085 del:0) #del:0+0=0 #dv:1 duration:6ms(0/0/6/0/0)
奇怪的是监控中be scan bytes很大 be scan rows 却没有太大变化
be cpu/mem :升级后,cpu 使用率提高了大约 20%













