BE crash when Stream Load with std::bad_alloc

【详述】
通过Stream Load向StarRocks导入数据时,HTTP response的状态码非200,导入失败。排查时发现,对应时间点,StarRocks BE崩溃。

导入数据为JSON格式,记录数量约为8000条。
StarRocks当时的日志输出如下:

  • be.out
tcmalloc: large alloc 1246183424 bytes == 0x1d8844000 @  0x5492aef 0x572435c 0x1ecee38 0x56745d5
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
*** Aborted at 1670296784 (unix time) try "date -d @1670296784" if you are using GNU date ***
PC: @     0x7f9cab57b387 __GI_raise
*** SIGABRT (@0x2f3) received by PID 755 (TID 0x7f9be9357700) from PID 755; stack trace: ***
    @          0x3cb65d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f9cac030630 (unknown)
    @     0x7f9cab57b387 __GI_raise
    @     0x7f9cab57ca78 __GI_abort
    @          0x17ba04d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x5674086 __cxxabiv1::__terminate()
    @          0x56740f1 std::terminate()
    @          0x5674244 __cxa_throw
    @          0x17b9f54 _Znwm.cold
    @          0x1a1fc5d std::__cxx11::basic_string<>::_M_mutate()
    @          0x315e9b2 starrocks::serde::ProtobufChunkSerde::serialize_without_meta()
    @          0x315ea41 starrocks::serde::ProtobufChunkSerde::serialize()
    @          0x232ab6c starrocks::stream_load::NodeChannel::try_send_chunk_and_fetch_status()
    @          0x232aea4 _ZNSt17_Function_handlerIFvPN9starrocks11stream_load11NodeChannelEEZNS1_13OlapTableSink19_send_chunk_processEvEUlS3_E_E9_M_invokeERKSt9_Any_dataOS3_
    @          0x2325e43 starrocks::stream_load::OlapTableSink::_send_chunk_process()
    @          0x56ee410 execute_native_thread_routine
    @     0x7f9cac028ea5 start_thread
    @     0x7f9cab64396d __clone
    @                0x0 (unknown)
start time: Tue Dec 6 03:20:28 UTC 2022c
*** Aborted at 1670296784 (unix time) try "date -d @1670296784" if you are using GNU date ***
PC: @     0x7f9cab57b387 __GI_raise
*** SIGABRT (@0x2f3) received by PID 755 (TID 0x7f9be9357700) from PID 755; stack trace: ***
    @          0x3cb65d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f9cac030630 (unknown)
    @     0x7f9cab57b387 __GI_raise
    @     0x7f9cab57ca78 __GI_abort
    @          0x17ba04d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x5674086 __cxxabiv1::__terminate()
    @          0x56740f1 std::terminate()
    @          0x5674244 __cxa_throw
    @          0x17b9f54 _Znwm.cold
    @          0x1a1fc5d std::__cxx11::basic_string<>::_M_mutate()
    @          0x315e9b2 starrocks::serde::ProtobufChunkSerde::serialize_without_meta()
    @          0x315ea41 starrocks::serde::ProtobufChunkSerde::serialize()
    @          0x232ab6c starrocks::stream_load::NodeChannel::try_send_chunk_and_fetch_status()
    @          0x232aea4 _ZNSt17_Function_handlerIFvPN9starrocks11stream_load11NodeChannelEEZNS1_13OlapTableSink19_send_chunk_processEvEUlS3_E_E9_M_invokeERKSt9_Any_dataOS3_
    @          0x2325e43 starrocks::stream_load::OlapTableSink::_send_chunk_process()
    @          0x56ee410 execute_native_thread_routine
    @     0x7f9cac028ea5 start_thread
    @     0x7f9cab64396d __clone
    @                0x0 (unknown)
start time: Tue Dec 6 03:20:28 UTC 2022

【StarRocks版本】2.2.5
【集群规模】1FE + 1BE 混部
【机器信息】32G内存,但由于单台服务器上部署的服务较多,在be.conf中有如下限制

    tc_use_memory_min = 0
    tc_free_memory_rate = 0

    # 32G * 0.24 = 7.68G
    mem_limit = 24%
    # load mem = mem_limit * 60%, then would be around 32G * 0.24 * 0.6 = 4.6G
    load_process_max_memory_limit_percent = 60

【导入或者导出方式】Stream Load

看起来导入的数据量并不是很大,应该不会占用太多内存,有什么思路或者方式来进一步排查?

您好,咱们这是测试环境吗?不建议在可用内存小于32g的机器上部署,测试的话您可以参考如下
小内存机器配置内存(根据实际情况选择)

  • 关闭ColumnPool: disable_column_pool=true
  • 减小预留内存: c_use_memory_min=xxx
  • 减小内存Cache:chunk_reserved_bytes_limit=xxx
  • 禁用StoragePageCache: disable_storage_page_cache=true
  • 控制Compaction线程数: max_compaction_concurrency=xxx
  • 控制单个Compaction任务一次性合并的版本数:max_cumulative_compaction_num_singleton_deltas=xxx
1赞

您可以修改 be.conf,去掉load_process_max_memory_limit_percent这个配置,增加一个配置 enable_new_load_on_memory_limit_exceeded=true

@yingying 感谢您的回复!

  1. 关于您提到的内存配置问题,部署的环境并不是测试环境,主要是因为面向的用户以及产品的特点(总数据量并没有那么大),导致我们没有单独的富裕的资源用于StarRocks部署。
  2. 这次比较疑惑的点在于,stream load的数据量并没有很大,而且并不是一个稳定复现的现象,是偶发的,是不是意味着并不是数据本身消耗太多内存,而是StarRocks其他地方?
  3. 您在两次回复中的部分参数,我并没有在be.conf的文档中找到,请问是在文档的其他地方么?他们的作用分别是?

可以通过 curl -XGET -s http://BE_IP:BE_HTTP_PORT/metrics | grep “^starrocks_be_.*_mem_bytes|^starrocks_be_tcmalloc_bytes_in_use”
来查看be具体内存占用情况,