【CN内存】CN节点query_pool内存不释放

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】
1、Stream Load写入-> 磁盘写满-> drop 表 -> 重建表
2、有exporter一直抓指标数据(这里需要对information_schema下的表做查询)
【业务影响】
【是否存算分离】是
【StarRocks版本】3.3.2
【集群规模】3FE,6CN
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】StarRocks 3.0 存算分离用户群 redscarf
【附件】

  • fe.log/beINFO/相应截图
I20241125 09:22:07.099787 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=db44962c-94f4-4d43-f232-b56dfbe4da9b
I20241125 09:22:07.099801 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment db44962c-94f4-4d43-f232-b56dfbe4da9b
I20241125 09:22:07.099870 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=a6430d4d-78bb-23dc-6a67-34462a96bbb3
I20241125 09:22:07.099878 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment a6430d4d-78bb-23dc-6a67-34462a96bbb3
I20241125 09:22:07.099892 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=db431cf3-86cb-7076-bca3-38a439644fb9
I20241125 09:22:07.099899 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment db431cf3-86cb-7076-bca3-38a439644fb9
I20241125 09:22:07.099912 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=bb489cac-aada-38ce-428b-f256a43acf87
I20241125 09:22:07.099920 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment bb489cac-aada-38ce-428b-f256a43acf87
I20241125 09:22:07.099933 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=0c4a80ca-e36a-1c39-fe9e-25d80f79dbb4
I20241125 09:22:07.099940 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 0c4a80ca-e36a-1c39-fe9e-25d80f79dbb4
I20241125 09:22:07.099954 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=f744af0e-f89e-f648-a3bf-722e0512bea8
I20241125 09:22:07.099966 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment f744af0e-f89e-f648-a3bf-722e0512bea8
I20241125 09:22:07.099975 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=ec45fb1d-4c39-cc69-9e19-1afd7b9fc4a5
I20241125 09:22:07.099989 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment ec45fb1d-4c39-cc69-9e19-1afd7b9fc4a5
I20241125 09:22:07.099997 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=264ccb9b-1249-0f3c-5a20-265f24e8a3ba
I20241125 09:22:07.100009 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 264ccb9b-1249-0f3c-5a20-265f24e8a3ba
I20241125 09:22:07.100017 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=1d413bea-78fb-673e-ec05-0bc20c2b21a1
I20241125 09:22:07.100031 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 1d413bea-78fb-673e-ec05-0bc20c2b21a1
I20241125 09:22:07.100039 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=d04614ff-f6a3-4686-90a5-3c6a4d717fae
I20241125 09:22:07.100052 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment d04614ff-f6a3-4686-90a5-3c6a4d717fae
I20241125 09:22:07.100060 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=6e46f73b-cbb5-1840-48ae-c5baaffec2ab
I20241125 09:22:07.100074 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 6e46f73b-cbb5-1840-48ae-c5baaffec2ab
I20241125 09:22:07.100082 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=294429e4-38ff-5d96-f2dd-f156d0af6d99
I20241125 09:22:07.100089 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 294429e4-38ff-5d96-f2dd-f156d0af6d99
I20241125 09:22:07.100103 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=d542de6b-0253-9207-cb0d-b14012d454a3
I20241125 09:22:07.100110 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment d542de6b-0253-9207-cb0d-b14012d454a3
I20241125 09:22:07.100123 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=724066be-eee3-ad13-aef3-27f768ff5cb9
I20241125 09:22:07.100130 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 724066be-eee3-ad13-aef3-27f768ff5cb9
I20241125 09:22:07.100143 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=a945c58b-fbd5-703f-def4-712bca7e9787
I20241125 09:22:07.100166 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment a945c58b-fbd5-703f-def4-712bca7e9787
I20241125 09:22:07.100180 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=254c819d-de35-27a4-6584-053618239989
I20241125 09:22:07.100187 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 254c819d-de35-27a4-6584-053618239989
I20241125 09:22:07.100195 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=0747a6d5-8161-2007-e265-5f9548f22b98
I20241125 09:22:07.100207 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 0747a6d5-8161-2007-e265-5f9548f22b98
I20241125 09:22:07.100220 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=454037cc-91c4-4a0e-53d3-dc89b5621da9
I20241125 09:22:07.100228 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 454037cc-91c4-4a0e-53d3-dc89b5621da9
I20241125 09:22:07.100242 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=7a4ad907-95d1-9ab6-c5dc-e897b4580592
I20241125 09:22:07.100250 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 7a4ad907-95d1-9ab6-c5dc-e897b4580592
I20241125 09:22:07.100264 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=0741ff68-2c43-8e48-baee-f86b3fb5508a
I20241125 09:22:07.100271 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 0741ff68-2c43-8e48-baee-f86b3fb5508a
I20241125 09:22:07.100281 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=2c4bc199-cbc8-4865-437b-c38cf3142981
I20241125 09:22:07.100289 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 2c4bc199-cbc8-4865-437b-c38cf3142981
I20241125 09:22:07.100297 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=9042c214-17e2-68a5-9691-4f131820e3a8
I20241125 09:22:07.100307 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 9042c214-17e2-68a5-9691-4f131820e3a8
I20241125 09:22:07.100317 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=81494aeb-7103-fc26-8486-c598a27647aa
I20241125 09:22:07.100325 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 81494aeb-7103-fc26-8486-c598a27647aa
I20241125 09:22:07.100333 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=aa473540-7d56-c2f6-fce9-ecbbf73cb1a6
I20241125 09:22:07.100344 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment aa473540-7d56-c2f6-fce9-ecbbf73cb1a6
I20241125 09:22:07.100353 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=794160e6-6080-6a46-abff-0b192cdd97aa
I20241125 09:22:07.100363 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 794160e6-6080-6a46-abff-0b192cdd97aa
I20241125 09:22:07.100374 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=a748356c-53da-c0bd-e1ef-431522a10f82
I20241125 09:22:07.100384 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment a748356c-53da-c0bd-e1ef-431522a10f82
I20241125 09:22:07.100395 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=a448a704-7c78-0a2a-f912-059517240f89
I20241125 09:22:07.100405 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment a448a704-7c78-0a2a-f912-059517240f89
I20241125 09:22:07.100417 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=fb479d76-b647-1f6d-1b61-98f39ebaca9e
I20241125 09:22:07.100427 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment fb479d76-b647-1f6d-1b61-98f39ebaca9e
I20241125 09:22:07.100449 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=7f44ccaf-2c56-ddaa-9000-ebdc563b5587
I20241125 09:22:07.100457 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 7f44ccaf-2c56-ddaa-9000-ebdc563b5587
I20241125 09:22:07.100468 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=aa483d3a-68bb-abce-0b52-5d6e835e71bf
I20241125 09:22:07.100476 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment aa483d3a-68bb-abce-0b52-5d6e835e71bf
I20241125 09:22:07.100484 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=87420a98-b109-bf43-bda7-62b98ed815b2
I20241125 09:22:07.100491 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 87420a98-b109-bf43-bda7-62b98ed815b2
I20241125 09:22:07.100499 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=9845ccdf-7c8b-32e8-227b-071b9cb6708a
I20241125 09:22:07.100506 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 9845ccdf-7c8b-32e8-227b-071b9cb6708a
I20241125 09:22:07.100514 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=4b46c787-0e74-26ff-e232-5e6ad09d41b0
I20241125 09:22:07.100522 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 4b46c787-0e74-26ff-e232-5e6ad09d41b0
I20241125 09:22:07.100530 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=1e4f2494-ae6b-8599-e2c0-e519368a6295
I20241125 09:22:07.100537 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 1e4f2494-ae6b-8599-e2c0-e519368a6295
I20241125 09:22:07.100545 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=914af4d8-9264-2832-0c9e-def47b745cb1
I20241125 09:22:07.100552 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 914af4d8-9264-2832-0c9e-def47b745cb1
I20241125 09:22:07.100560 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=4248b3fd-4504-fe35-8a78-f944815724b1
I20241125 09:22:07.100567 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 4248b3fd-4504-fe35-8a78-f944815724b1
I20241125 09:22:07.100575 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=9642260d-4cdb-45a7-ab4f-9e493dabdd99
I20241125 09:22:07.100582 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 9642260d-4cdb-45a7-ab4f-9e493dabdd99
I20241125 09:22:07.100590 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=494639cb-4377-a50f-0265-45595f483e9d
I20241125 09:22:07.100597 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 494639cb-4377-a50f-0265-45595f483e9d
I20241125 09:22:07.100606 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=154ee038-2709-cee0-0056-f7a9faa60db5
I20241125 09:22:07.100614 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 154ee038-2709-cee0-0056-f7a9faa60db5
I20241125 09:22:07.100622 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=eb4b7ac4-d5d1-9682-ff6e-839db6718cbe
I20241125 09:22:07.100629 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment eb4b7ac4-d5d1-9682-ff6e-839db6718cbe
I20241125 09:22:07.100637 140311655126592 plan_fragment_executor.cpp:373] cancel(): fragment_instance_id=3f4bf745-ddcf-fa15-ce44-74073f95a38c
I20241125 09:22:07.100644 140311655126592 fragment_mgr.cpp:580] FragmentMgr cancel worker going to cancel timeout fragment 3f4bf745-ddcf-fa15-ce44-74073f95a38c

从日志中可以确认,当前fragment是timeout的了,需要清理,且清理线程在处理了,但是Fragment始终未能清理掉,一直占用query_pool的内存

目前github上提了个issue,见: [BUG] CN nodes can not release query_pool Fragment memory · Issue #53155 · StarRocks/starrocks (github.com)

跟这个应该没啥关系,可以获取一下pstack

cn0.txt (54.1 KB)

你是关闭了pipeline load吗?

我没有设置什么参数呀,这个是哪个参数控制的,这里我不能保证是Stream Load触发的哈,因为我集群内也有查询的

看起来是因为 JsonReader 没有实现 fast cancel 导致的

# 是这个stack的嘛
    0x7f9d1fdec115  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x91115)
    0x7f9d1fdeea41  pthread_cond_wait
         0xeabff30  std::condition_variable::wait(std::unique_lock<std::mutex>&)
         0x867e290  starrocks::StreamLoadPipe::read()
         0x77dda4a  starrocks::StreamLoadPipeReader::read()
         0x77da672  starrocks::JsonReader::_read_file_stream()
         0x77daf4f  starrocks::JsonReader::_read_and_parse_json()
         0x77dcf73  starrocks::JsonScanner::_open_next_reader()
         0x77dd1e4  starrocks::JsonScanner::get_next()
         0x7703ee9  starrocks::connector::FileDataSource::get_next(starrocks::RuntimeState*, std::shared_ptr<starrocks::Chunk>*)
         0x78c3e5a  starrocks::ConnectorScanNode::_scanner_thread(starrocks::ConnectorScanner*)
         0x88e01e4  starrocks::ThreadPool::dispatch_thread()
         0x88d95b9  starrocks::Thread::supervise_thread(void*)
    0x7f9d1fdefac3  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x94ac3)
    0x7f9d1fe80a04  clone

我根据stack看了一下代码

# plan_fragment_executor.cpp 文件
void PlanFragmentExecutor::cancel() {
    LOG(INFO) << "cancel(): fragment_instance_id=" << print_id(_runtime_state->fragment_instance_id());
    DCHECK(_prepared);
    {
        std::lock_guard<std::mutex> l(_status_lock);
        if (_runtime_state->is_cancelled()) {
            return;
        }
        _runtime_state->set_is_cancelled(true);
    }

    const TQueryOptions& query_options = _runtime_state->query_options();
    if (query_options.query_type == TQueryType::LOAD && (query_options.load_job_type == TLoadJobType::BROKER ||
                                                         query_options.load_job_type == TLoadJobType::INSERT_QUERY ||
                                                         query_options.load_job_type == TLoadJobType::INSERT_VALUES)) {
        starrocks::ExecEnv::GetInstance()->profile_report_worker()->unregister_non_pipeline_load(
                _runtime_state->fragment_instance_id());
    }
    if (_stream_load_contexts.size() > 0) {
        for (const auto& stream_load_context : _stream_load_contexts) {
            if (stream_load_context->body_sink) {
                Status st;
                // 这里是执行了cancel的
                stream_load_context->body_sink->cancel(st);
            }
            if (_channel_stream_load) {
                _exec_env->stream_context_mgr()->remove_channel_context(stream_load_context);
            }
        }
        _stream_load_contexts.resize(0);
    }
    _runtime_state->exec_env()->stream_mgr()->cancel(_runtime_state->fragment_instance_id());
    (void)_runtime_state->exec_env()->result_mgr()->cancel(_runtime_state->fragment_instance_id());

    if (_is_runtime_filter_merge_node) {
        _runtime_state->exec_env()->runtime_filter_worker()->close_query(_query_id);
    }
}
# stream_load_pipe.h文件

class StreamLoadPipe : public MessageBodySink {
public:
    StreamLoadPipe(size_t max_buffered_bytes = 1024 * 1024, size_t min_chunk_size = 64 * 1024)
            : _max_buffered_bytes(max_buffered_bytes), _min_chunk_size(min_chunk_size) {}
    ~StreamLoadPipe() override = default;


private:
    Status _append(const ByteBufferPtr& buf);

    std::condition_variable _put_cond;
    std::condition_variable _get_cond;
    bool _non_blocking_read{false};

    bool _finished{false};
   // 这里是否有线程安全的问题?
    bool _cancelled{false};

    ByteBufferPtr _write_buf;
    ByteBufferPtr _read_buf;
    Status _err_st = Status::OK();
};


# stream_load_pipe.cpp文件
StatusOr<ByteBufferPtr> StreamLoadPipe::read() {
    if (_non_blocking_read) {
        return no_block_read();
    }
    std::unique_lock<std::mutex> l(_lock);
    // 这里一直在等待
    _get_cond.wait(l, [&]() { return _cancelled || _finished || !_buf_queue.empty(); });

    // cancelled
    if (_cancelled) {
        return Status::EndOfFile("all data has been read");
    }

    // finished
    if (_buf_queue.empty()) {
        DCHECK(_finished);
        return Status::EndOfFile("all data has been read");
    }
    auto buf = std::move(_buf_queue.front());
    _buf_queue.pop_front();
    _buffered_bytes -= buf->limit;
    _put_cond.notify_one();
    return buf;
}

请问这个问题目前解决了么,我也发现我的be的query_pool一直占用40多G不释放,但我不是存算分离的

未完全解决,之前定位该问题是是因为transaction stream load的时候,be那边有锁泄露导致内存无法回收
你可以通过be的web ui查一下,
查询方式:beip:beport/mem_tracker?type=query_pool&upper_level=3
看是否有fragment不释放,如果你有大量的fragment不释放的话,那肯定就是和我这个一个问题了

请问我这个是什么东西在站内存呀,看上去不是fragment

那你这个和我这边的不一样,你这个是 default_wg 资源组内,有大量的资源在使用,你可以看看是不是有大查询啥的,

好的谢谢,我看一下

请问,我实时的flink程序插入数据,会占用这部分内存么,是jsonload的,而且数据量很大,一分钟在10w条左右

你是什么版本?如果是3.3版本的(别的版本的我没有细看过),Stream Load的内存用的是query_pool的内存,但是每一个Stream Load会有自己的Fragment,占用内存的形式应该是Fragment内存占用多,但是你这个截图上是 default_wg 的多

我是3.3.0版本的,但是现在一直没有查询,所以感觉很奇怪这部分内存为什么一直没释放

sr的监控可以看看,是不是哪块内存有点问题

好的谢谢您