load data infile 和 compact失败

不知不觉 · 2024年02月26日 14:10

为了更快的定位您的问题，请提供以下信息，谢谢
【详述】我从 oss 目录导入 json 文件到 sr，每年大概有6亿条数据，每次导入一年的。表名称为 es.barragelist_mongodb. 最近失败n多次，都是装载失败。
【背景】为了提升查询效率，我以年份作为分区，每个分区生成 1024 个bucket，计划保存最近7年的数据，
为了提升导入效率，执行了 update information_schema.be_configs set value = 16 where name like ‘flush_thread_num_per_store’。
导入效率明显提升，但是到最后关头总会失败，发现是内存受限，新增了一个节点，修改了 load 的内存限制到50%，因为目前查询很少，只有一个导数任务，理论上不会有问题。
修改之后发现还是内存受限。发现随着load_limit 的提升，导入使用的内存也更多，这个时候因为barragelist_mongodb 表已经导入了2018和2019两个分区。导致会不停的执行compaction操作，每次都是失败，然后重试。

load操作和 compaction 最终占用的内存超过限制，导致任务失败。load 任务失败后， compaction的内存占用也变得很小。

【业务影响】
【是否存算分离】
【StarRocks版本】3.2.3
【集群规模】1 fe + 2 be
【机器信息】16核 128G
【表模型】主键模型
【导入或者导出方式】从oss 导入json文件，使用 load data infile

操作命令：
LOAD LABEL barragelist_mongodb_from_oss_all_2020
(
DATA INFILE(“s3://enlightent-backup/es-cluster/barragelist-mongodb/barragelist-mongodb/2020*/*”)
INTO TABLE barragelist_mongodb
FORMAT AS “JSON”
(updateTime, channel, name, index, @timestamp, mongoId, userId, channelType, startTime, mDanmuId, content, upCount, createTime, type)
set(updateday=from_unixtime(createTime/1000, ‘%Y-%m-%d’))
)
WITH BROKER
(
“aws.s3.enable_ssl” = “false”,
“aws.s3.use_instance_profile” = “false”,
“aws.s3.region” = “cn-beijing”,
“aws.s3.endpoint” = “https://oss-cn-beijing-internal.aliyuncs.com”,
“aws.s3.access_key” = “*********xxxx”,
“aws.s3.secret_key” = “xxxxxxuDJc7l”
)
PROPERTIES
(
“timeout” = “100000”,
“max_filter_ratio” = “0.1”
);

type:LOAD_RUN_FAIL; msg:Memory of process exceed limit. Pipeline Backend: 172.16.100.42, fragment: 373682ca-25ce-408b-86ed-e3ccf8f178f7 Used: 56931348088, Limit: 56696658001. Mem usage has exceed the limit of BE

type:LOAD_RUN_FAIL; msg:Memory of process exceed limit. try consume:1572864 Used: 40131672848, Limit: 50728566804. Mem usage has exceed the limit of BE

不知不觉 · 2024年02月26日 14:18

be.warning:
W0226 22:08:51.840982 14655 lake_service.cpp:214] Fail to submit publish version task: Timeout: acquire semaphore reached deadline=1708956531837. tablet_id=212513 txn_ids=774848

W0226 16:39:32.006901 30488 mem_hook.cpp:249] large memory alloc, query_id:373682ca-25ce-408b-86ed-e3ccf8f178f5 instance: 373682ca-25ce-408b-86ed-e3ccf8f178f7 acquire:1185655588 bytes, stack:
@ 0x2d93c8d malloc
@ 0x8ae1905 operator new()
@ 0x8ae19d9 operator new
@ 0x7a06fb6 simdjson::haswell::dom_parser_implementation::set_capacity()
@ 0x79ef6e0 simdjson::haswell::implementation::create_dom_parser_implementation()
@ 0x367ed63 simdjson::fallback::ondemand::document_stream::start()
@ 0x367c330 starrocks::JsonDocumentStreamParser::parse()
@ 0x366f6f6 starrocks::JsonReader::_read_and_parse_json()
@ 0x3673c3e starrocks::JsonScanner::_open_next_reader()
@ 0x36753a2 starrocks::JsonScanner::get_next()
@ 0x5ae2f91 starrocks::connector::FileDataSource::get_next()
@ 0x37c44dd starrocks::pipeline::ConnectorChunkSource::_read_chunk()
@ 0x3ae00ff starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
@ 0x37b7089 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
@ 0x38c26e1 starrocks::workgroup::ScanExecutor::worker_thread()
@ 0x2e493aa starrocks::ThreadPool::dispatch_thread()
@ 0x2e43e0a starrocks::supervise_thread()
@ 0x7f3793d07ea5 start_thread
@ 0x7f3793108b0d __clone

fe.warn.log:
2024-02-26 22:13:22,912 ERROR (lake-publish-task-66395|132140) [PublishVersionDaemon.publishPartition():832] Fail to publish partition 290607 of txn 775389: A error occurred: errorCode=62 errorMessage:method request time out, please check ‘onceTalkTimeout’ property. current value is:60000(MILLISECONDS) correlationId:1039788 timeout with bound channel =>[id: 0x89ac44ce, L:/172.16.100.40:39301 - R:/172.16.100.40:8060], host: 172.16.100.40

不知不觉 · 2024年02月26日 14:21

数据会有一些脏数据，因为 load data infile 不支持 where条件过滤，所以会生成很多的垃圾 partition。

最后尝试 update information_schema.be_configs set value = 4 where name like ‘flush_thread_num_per_store’。
目前好像可以入了，在99%的时候，state 变为 prepared。

不知不觉 · 2024年02月26日 14:25

compaction 里面就是这样，一直失败，一直重试，
一共有了将近 280个分区，
我已经用程序把垃圾分区删掉了。还是compaction 失败