starrocks be 内存持续增长

【详述】be节点在启动以后,内存以每天300-400M的速度持续增长,而这段时间并没上线什么新的任务,和大的查询操作。
【背景】整个集群有18个kafka routine load task和几个flink stream load task,并无大的查询。
【业务影响】可用内存一直在减少,当内存满了之后,无法申请新的内存,会导致查询等操作报错。
【StarRocks版本】2.2.10
【集群规模】3fe(3 follower)+4be(fe与be单独部署)
【机器信息】fe:4C/16G,be:8c/32G
【附件】

  • be节点内存持续增长趋势图(其中一个节点,其他两个节点一样):

  • be节点cpu使用截图(其中一个节点,其他两个节点一样):

  • fe节点内存增长趋势(其中一个节点):

  • fe节点cpu使用截图(其中一个节点,非master):

  • be配置项:
    routine_load_thread_pool_size = 15
    load_process_max_memory_limit_percent = 50
    tc_use_memory_min = 0
    tc_free_memory_rate = 0

  • be内存指标统计截图:
    curl -XGET -s http://localhost:8040/metrics | grep "^starrocks_be_.*_mem_bytes|^starrocks_be_tcmalloc_bytes_in_use"


    starrocks_be_process_mem_bytes 指标一直在增长,平均保持每天增长300M左右
    curl http://localhost:8040/memz

    curl http://localhost:8040/mem_tracker

    curl -XGET -s http://localhost:8040/metrics | grep "score"

  • 补充说明:
    集群查询压力并不大,有18个kafka routine load任务和几个flink stream load任务和部分查询任务。

  • be.WARN日志(只截取了一部分):
    W0108 18:36:47.622715 7767 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=9d4623344d759d14-f31afd154dbcdc90, job_id=-1, txn_id: 8198292, label=41f7cc1c-22e0-4497-b9fb-889863b66254, db=ods
    W0108 19:09:21.710310 7758 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=744225f0ab35d9be-49632d9f5304f5ab, job_id=-1, txn_id: 8201300, label=b7442045-d787-42fe-9521-f5596080a173, db=app
    W0108 19:09:21.750111 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=a94388a11fb9c9d5-aefbb4d9af520ab3, job_id=-1, txn_id: 8201306, label=e7fae10e-a876-4e44-9aeb-07d706ca6318, db=ods
    W0108 19:36:04.610087 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=814b445a182fad85-7d41254d0b1c9ba1, job_id=-1, txn_id: 8203732, label=42902edb-d080-4415-addc-b4ee4a79bc7a, db=app
    W0108 19:36:04.633272 7767 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=43425fbf977cf80a-7b7a0d74734d2eb0, job_id=-1, txn_id: 8203731, label=14c7edd2-c0b2-405c-a80c-f8eb55ed4315, db=ods
    W0108 19:36:04.674805 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=fe40980f8166f12c-eaafe2f41f02068a, job_id=-1, txn_id: 8203730, label=aa027062-d890-4cf2-b820-72b03dfb94b5, db=ods
    W0108 19:36:05.147940 7622 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=4602c5ff54624394-901fe72496f1e9d0, job_id=7220972, txn_id: 8203726, label=ods_trade_order_kafka_routine_job_20221227-7220972-4602c5ff-5462-4394-901f-e72496f1e9d0-8203726, db=default_cluster:ods
    W0108 19:38:02.658777 7766 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=584bd4f52eb4b0fb-9006afa052b15585, job_id=-1, txn_id: 8203917, label=41def4c9-6864-4d70-bc46-377a27d302e3, db=ods
    W0108 20:06:13.216154 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=c34d17a88520f194-2a428d474dfb5a84, job_id=-1, txn_id: 8206510, label=bceb9066-6515-4bfe-bfe9-bb0ff728b26d, db=ods
    W0108 20:07:49.900297 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=9d4deda005468354-753507925b2b5c89, job_id=-1, txn_id: 8206652, label=1f8db7c7-9159-429f-8070-9cadf0bbf068, db=ods
    W0108 20:17:44.131151 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=6e4fdceecdba59bc-b5840b21613529a9, job_id=-1, txn_id: 8207556, label=270d2461-414c-43b1-9f1d-f1d977dfccfa, db=ods

W0108 05:28:28.110669 7621 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=bf05ffaa7a674f50-8d34509de9553374, job_id=6504765, txn_id: 8125570, label=ods_deribit_option_ticker_kafka_routine_job_20221207-6504765-bf05ffaa-7a67-4f50-8d34-509de9553374-8125570, db=default_cluster:ods
W0108 05:28:28.110858 7608 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=61c79fff363e4c40-affe868a4748a03a, job_id=6186703, txn_id: 8125569, label=ods_market_transaction_kafka_routine_job_20221130-6186703-61c79fff-363e-4c40-affe-868a4748a03a-8125569, db=default_cluster:ods
W0108 05:29:44.493728 7630 beta_rowset.cpp:101] Fail to delete /opt/module/starrocks/storage/data/189/7640065/1387475083/0200000077f137772b40686bff82ff3912cbdd36233fc091_0.dat: No such file or directory [2]
W0108 05:29:44.494369 7630 beta_rowset.cpp:124] Fail to remove files in rowset id=/opt/module/starrocks/storage/data/189/7640065/1387475083/0200000077f137772b40686bff82ff3912cbdd36233fc091

W0108 04:13:01.424372 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639144
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639144
W0108 04:13:28.504305 7672 engine_checksum_task.cpp:58] Not found tablet: 7639132
W0108 04:13:28.504335 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639132
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639132
W0108 04:30:57.615008 7672 engine_checksum_task.cpp:58] Not found tablet: 7639374
W0108 04:30:57.616307 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639374
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639374
W0108 04:31:11.504632 7672 engine_checksum_task.cpp:58] Not found tablet: 7639356
W0108 04:31:11.504653 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639356
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639356
W0108 04:38:38.328316 7634 version_graph.cpp:421] fail to find path in version_graph. spec_version: 0-5783
W0108 04:38:38.333878 7634 tablet.cpp:469] version not found. tablet_id: 7595141, version: 5783
W0108 04:38:38.334857 7634 tablet.cpp:814] 7595141.333761361.2b497ac47b0b1ab4-5d388460d80543ba has 1 missed version:[5782-5782],
W0108 04:46:43.989122 7672 engine_checksum_task.cpp:58] Not found tablet: 7639442
W0108 04:46:43.990126 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639442
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639442
W0108 04:46:43.990509 7672 engine_checksum_task.cpp:58] Not found tablet: 7639448
W0108 04:46:43.990522 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639448
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639448
W0108 04:49:13.398389 7672 engine_checksum_task.cpp:58] Not found tablet: 7639430
W0108 04:49:13.398416 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639430
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639430
W0108 04:49:19.204452 7672 engine_checksum_task.cpp:58] Not found tablet: 7639436
W0108 04:49:19.204478 7672 task_worker_pool.cpp:1309] check consistency failed. status: Not found: Not found tablet: 7639436
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7639436

/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
W0108 03:13:51.642086 7672 task_worker_pool.cpp:1309] check consistency failed. status: Internal error: fail to init reader. tablet=7555487.1264657145.ab477461203d49f0-c11493eea4e6019dres=Unknown code(45): : version already been compacted. tablet_id: 7555487, version: 6460
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7555487
W0108 03:21:50.410542 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=86427445b690785e-4464b6e62cd25da4, job_id=-1, txn_id: 8114184, label=782900cb-fff9-473a-a1c8-1027f403360c, db=ods
W0108 03:25:09.958806 7672 version_graph.cpp:388] fail to find path in version_graph. spec_version: 0-1280
W0108 03:25:09.960064 7672 tablet.cpp:464] version already been compacted. tablet_id: 7595523, version: 1280
W0108 03:25:09.960119 7672 tablet_reader.cpp:64] fail to init reader. tablet=7595523.496654587.0140b05c95cd9878-a237b90794433b8fres=Unknown code(45): : version already been compacted. tablet_id: 7595523, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
W0108 03:25:09.960127 7672 engine_checksum_task.cpp:90] Failed to prepare tablet reader. tablet=7595523.496654587.0140b05c95cd9878-a237b90794433b8f, error:Internal error: fail to init reader. tablet=7595523.496654587.0140b05c95cd9878-a237b90794433b8fres=Unknown code(45): : version already been compacted. tablet_id: 7595523, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
W0108 03:25:09.960141 7672 task_worker_pool.cpp:1309] check consistency failed. status: Internal error: fail to init reader. tablet=7595523.496654587.0140b05c95cd9878-a237b90794433b8fres=Unknown code(45): : version already been compacted. tablet_id: 7595523, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
/root/starrocks/be/src/storage/storage_engine.cpp:1051 task->execute(), signature: 7595523
W0108 03:25:29.617259 7672 version_graph.cpp:388] fail to find path in version_graph. spec_version: 0-1280
W0108 03:25:29.617291 7672 tablet.cpp:464] version already been compacted. tablet_id: 7595503, version: 1280
W0108 03:25:29.617301 7672 tablet_reader.cpp:64] fail to init reader. tablet=7595503.496654587.814574edd3ce2f3e-b92366d7d217fab0res=Unknown code(45): : version already been compacted. tablet_id: 7595503, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
W0108 03:25:29.617307 7672 engine_checksum_task.cpp:90] Failed to prepare tablet reader. tablet=7595503.496654587.814574edd3ce2f3e-b92366d7d217fab0, error:Internal error: fail to init reader. tablet=7595503.496654587.814574edd3ce2f3e-b92366d7d217fab0res=Unknown code(45): : version already been compacted. tablet_id: 7595503, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)
W0108 03:25:29.617332 7672 task_worker_pool.cpp:1309] check consistency failed. status: Internal error: fail to init reader. tablet=7595503.496654587.814574edd3ce2f3e-b92366d7d217fab0res=Unknown code(45): : version already been compacted. tablet_id: 7595503, version: 1280
/root/starrocks/be/src/storage/tablet.cpp:509 capture_consistent_versions(spec_version, &version_path)

W0107 22:28:58.308957 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=bf4e01b361cd42e3-b49ae31568f81986, job_id=-1, txn_id: 8087619, label=3a99f0a5-2efc-42ec-8864-ce9175b4ed68, db=ods
W0107 22:28:58.339509 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=d24a3a1eba339aed-39ee90a4a939ebad, job_id=-1, txn_id: 8087622, label=cfbb17f8-1510-4698-adfc-ee2167ff4cd4, db=ods
W0107 22:29:46.801215 7767 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=104f88a83b0ed6a8-e56a35fe30dc2394, job_id=-1, txn_id: 8087692, label=9b62ec9e-c612-4fa6-ab0d-d16653bc74f4, db=app
W0107 22:29:46.868320 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=e04b84f1eb3b898d-f711bca8b191e6a9, job_id=-1, txn_id: 8087694, label=a651a761-db7e-4e63-b3a9-5229c446b36f, db=ods
W0107 22:39:25.869778 7767 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=864362face0a397e-377e75f8ebb55998, job_id=-1, txn_id: 8088592, label=04cac543-1744-423a-93a0-3d933ebe1ba5, db=ods
W0107 22:39:25.902827 7765 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=6a4e8a8bc71923d8-921855ac194f358d, job_id=-1, txn_id: 8088589, label=377cf9d0-16f3-4e3d-8dea-b5470e1b84ef, db=ods
W0107 22:46:58.010635 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=fb457d218f139fe5-ab5acf015801718a, job_id=-1, txn_id: 8089279, label=5eefc81d-d8c0-42f9-818c-f8563cfee308, db=ods
W0107 22:54:58.755590 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=be42087ed22ec22a-412580207a8ec79f, job_id=-1, txn_id: 8089998, label=6f87dd36-3295-42ea-9221-aec90941d34e, db=ods
W0107 22:56:44.228694 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=8140e1eaa2d2346c-b369a84c84ea9397, job_id=-1, txn_id: 8090159, label=dc6b253c-4385-46cd-9c3e-b63bc26834a9, db=ods
W0107 22:57:21.186062 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=6543618c6fd0efba-3d8f7623a51c2bb2, job_id=-1, txn_id: 8090211, label=f92b8c01-e86c-4fa4-8914-b2fb7df854d4, db=ods
W0107 22:57:21.107124 7768 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=124668daa6eac5a3-80a34f70d882d6bd, job_id=-1, txn_id: 8090186, label=3924d085-c948-4e1d-b32d-d614f26dda7a, db=ods
W0107 23:19:10.389217 7769 stream_load_executor.cpp:202] commit transaction failed, errmsg=Publish timeout. The data will be visible after a whileid=5244b6dd85cccf86-50d2d91165bedaa2, job_id=-1, txn_id: 8092212, label=8464bc5c-786b-4e65-8c52-b6cc5db02ffa, db=app

W0108 03:00:07.943667 7565 fragment_mgr.cpp:195] Fail to open fragment 7e24d725-8ebd-11ed-91b1-06a3fdf487d3: Cancelled: Cancelled because of runtime state is cancelled
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:329 _plan->get_next(_runtime_state, &_chunk, &_done)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:217 _get_next_internal_vectorized(&chunk)
W0108 03:02:20.731957 7576 fragment_mgr.cpp:195] Fail to open fragment cccb44f2-8ebd-11ed-91b1-06a3fdf487d3: Cancelled: Cancelled because of runtime state is cancelled
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:329 _plan->get_next(_runtime_state, &_chunk, &_done)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:217 _get_next_internal_vectorized(&chunk)
W0108 03:03:23.443461 7533 fragment_mgr.cpp:195] Fail to open fragment f0cc30a1-8ebd-11ed-91b1-06a3fdf487d3: Cancelled: Cancelled because of runtime state is cancelled
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:329 _plan->get_next(_runtime_state, &_chunk, &_done)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:217 _get_next_internal_vectorized(&chunk)
W0108 03:04:25.784721 7565 fragment_mgr.cpp:195] Fail to open fragment 18f5de02-8ebe-11ed-91b1-06a3fdf487d3: Cancelled: Cancelled because of runtime state is cancelled
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:329 _plan->get_next(_runtime_state, &_chunk, &_done)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:217 _get_next_internal_vectorized(&chunk)

  • be.out:
    正常,无特殊报错日志。

- 特殊说明:
** 集群在2022-12-15做过一次升级,从2.2.5 --> 2.2.10,按照官方说明,只替换了lib文件夹**
测试环境是同样的版本,但是并没出现生产环境be持续增长的异常情况,starrocks_be_process_mem_bytes一直保持在3G左右,区别就是测试环境任务和数据量很少

请问下有多次对比过be metrics和mem_tracker的内存吗?有确认具体差别在哪里吗?

有的哈,starrocks_be_process_mem_bytes,starrocks_be_tcmalloc_bytes_in_use,这两个指标一直在增长,还有个问题,starrocks_be_tablet_meta_mem_bytes,这个指标一直在负数,我怀疑是元数据内存在增长,不释放,导致内存泄漏了。