truncate分区数据性能极慢或者无法清空

华_天涯 · 2024年05月28日 08:22

为了更快的定位您的问题，请提供以下信息，谢谢
【详述】
上一个sr版本为2.5.8，由于fe启动异常不得已升级到3.1.11版本，升级之后执行。通过truncate table table_name drop partition(p20240527,20240526) 执行删除分区数据的操作，非常的慢或者说根本没有办法执行成功。而通过delete删除分区数据方式是正常的。2.5.8版本truncate分区数据是正常的。
【背景】
升级到3.1.11之后，从hive同步数据至sr时，修改两个参数：“tablet_create_timeout_second”=“320”，“routine_load_unstable_threshold_second” = “5400”
其它参数跟2.5.8时期的一致，没有做任何调整。
【业务影响】
【是否存算分离】否
【StarRocks版本】例如：3.1.11
【集群规模】例如：4fe（3 follower+1observer）+4be（fe与be混部）
【机器信息】CPU虚拟核/内存/网卡，例如：24C/64G/万兆
【联系方式】qq邮箱：564685964@qq.com.
【附件】fe.conf 配置文件如下：

Licensed to the Apache Software Foundation (ASF) under one

or more contributor license agreements. See the NOTICE file

distributed with this work for additional information

regarding copyright ownership. The ASF licenses this file

to you under the Apache License, Version 2.0 (the

“License”); you may not use this file except in compliance

with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,

software distributed under the License is distributed on an

“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

KIND, either express or implied. See the License for the

specific language governing permissions and limitations

under the License.

#####################################################################

The uppercase properties are read and exported by bin/start_fe.sh.

To see all Frontend configurations,

see fe/src/org/apache/doris/common/Config.java

#####################################################################

the output dir of stderr and stdout

LOG_DIR = ${DORIS_HOME}/log

DATE = “$(date +%Y%m%d-%H%M%S)”
JAVA_OPTS="-Xmx25600m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSPar
allelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$DATE"

For jdk 9+, this JAVA_OPTS will be used as default JVM options

JAVA_OPTS_FOR_JDK_9="-Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xl
og:gc*:$DORIS_HOME/log/fe.gc.log.$DATE:time"

the lowercase properties are read by main program.

INFO, WARN, ERROR, FATAL

sys_log_level = INFO

store metadata, create it if it is not exist.

Default value is ${DORIS_HOME}/doris-meta

meta_dir = ${DORIS_HOME}/doris-meta

meta_dir = /datacenter/fe/meta

http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
mysql_service_nio_enabled = true

Choose one if there are more than one ip except loopback address.

Note that there should at most one ip match this list.

If no ip match this rule, will choose one randomly.

use CIDR format, e.g. 10.10.10.0/24

Default value is empty.

priority_networks = 10.10.10.0/24;192.168.0.0/16

priority_networks=172.16.31.0/24

Advanced configurations

log_roll_size_mb = 1024

sys_log_dir = /datacenter/fe/log

sys_log_roll_num = 10

sys_log_verbose_modules =

audit_log_dir = /datacenter/fe/log

audit_log_modules = slow_query, query

audit_log_roll_num = 10

meta_delay_toleration_second = 10

qe_max_connection = 1024

max_conn_per_user = 100

qe_query_timeout_second = 300

qe_slow_log_ms = 5000

max_routine_load_task_num_per_be = 100
max_routine_load_job_num = 500
enable_statistic_collect = true
hive_meta_cache_refresh_interval_s=3600

ignore_unknown_id = true

ignore_unknown_log_id = true

enable_hms_events_incremental_sync=true

hms_events_polling_interval_ms=20000

hms_events_batch_size_per_rpc=1000

enable_hms_parallel_process_evens=true

hms_process_events_parallel_num=10

enable_hms_events_incremental_sync=true
hms_events_polling_interval_ms=5000
hms_events_batch_size_per_rpc=500
enable_hms_parallel_process_evens=true
hms_process_events_parallel_num=4
tablet_create_timeout_second=320
routine_load_unstable_threshold_second =5400

Doni · 2024年05月28日 12:55

这个表两个分区的分桶数设置的多少

华_天涯 · 2024年05月29日 01:11

现在是所有的分区表truncate部分分区都很慢，而不是个别表。分区表基本上都是13个分桶数。

华_天涯 · 2024年05月29日 01:23

这个是其中一个表的建表语句：

CREATE TABLE dws_fact_trade_order_ymd (

dateid date NULL COMMENT “doris 分区字段”,

o_order_no varchar(65533) NULL COMMENT “原始订单编号”,

v_goods_profit_share_amount bigint(20) NULL DEFAULT “-1” COMMENT “商品分摊毛利额”,

v_goods_profit_month_return_amount bigint(20) NULL DEFAULT “-1” COMMENT “商品供应商月返毛利额”,

create_time varchar(65533) NULL COMMENT “记录创建时间”,

update_time varchar(65533) NULL COMMENT “记录更新时间”

) ENGINE=OLAP

UNIQUE KEY(dateid, o_order_no)

COMMENT “OLAP”

PARTITION BY RANGE(dateid)

(PARTITION p20240530 VALUES [(“2024-05-30”), (“2024-05-31”)),

PARTITION p20240531 VALUES [(“2024-05-31”), (“2024-06-01”)),

PARTITION p20240601 VALUES [(“2024-06-01”), (“2024-06-02”)))

DISTRIBUTED BY HASH(o_order_no) BUCKETS 13

PROPERTIES (

“replication_num” = “3”,

“bloom_filter_columns” = “dateid, o_order_no”,

“dynamic_partition.enable” = “true”,

“dynamic_partition.time_unit” = “DAY”,

“dynamic_partition.time_zone” = “Asia/Shanghai”,

“dynamic_partition.start” = “-700”,

“dynamic_partition.end” = “3”,

“dynamic_partition.prefix” = “p”,

“dynamic_partition.buckets” = “13”,

“dynamic_partition.history_partition_num” = “0”,

“in_memory” = “false”,

“enable_persistent_index” = “false”,

“replicated_storage” = “false”,

“compression” = “LZ4”

);

华_天涯 · 2024年05月31日 01:33

1、truncate table table_name partition(p20240424)，有时候很快(5秒钟左右)，有时候很慢(3分钟左右)。
2、truncate table table_name partition(p20240405,p20240406,p20240407,p20240408)
有时候快的时候2秒钟左右，慢的时候超过10分钟都无法删除。

总体感觉不稳定，时快时慢的。

Doni · 2024年05月31日 05:57

发一下出现truncate 变慢前后半小时的 leader fe.log 日志，如果能复现的话，在truncate慢的时候对 fe 打个 jstack，jstack -l $fe_pid > fe_jstack.log 再发下 fe_jstack.log

Doni · 2024年05月31日 06:00

另外再看下truncate慢时 be日志中排队的task 有没有很多 grep -E ‘task_count_in_queue=[2-9][0-9]{3,}’ be.INFO|tail

华_天涯 · 2024年05月31日 06:34

案例：truncate table dws.dws_fact_trade_suborder_detail_ymd_test partition(p20240315,p20240316,p20240317)
执行开始时间：2024-05-31 14:10:59.929

1、fe_jstack日志
fe_jstack.log (527.8 KB)

2、fe.log日志：通过如下命令搜索：tail -200f fe.log |grep dws_fact_trade_suborder_detail_ymd_test

2024-05-31 14:10:59.929+08:00 INFO (starrocks-mysql-nio-pool-4664|987852) [LocalMetastore.buildPartitions():1727] start to build 3 partitions sequentially for table dws.dws_fact_trade_suborder_detail_ymd_test with 117 replicas
2024-05-31 14:17:18.991+08:00 INFO (starrocks-mysql-nio-pool-4671|988220) [LocalMetastore.truncateTable():4545] finished to truncate table dws.dws_fact_trade_suborder_detail_ymd_test, partitions: PARTITIONS (p20240315, p20240316, p20240317)
2024-05-31 14:27:05.657+08:00 WARN (starrocks-mysql-nio-pool-4664|987852) [StmtExecutor.handleDdlStmt():1473] DDL statement (/* ApplicationName=DBeaver 21.0.3 - SQLEditor <Script-214.sql> */ truncate table dws.dws_fact_trade_suborder_detail_ymd_test partition(p20240315,p20240316,p20240317)) process failed.
日志中已经显示完成，但是客户端还在执行中(截止到2024-05-31 14:22:59.929)

截止到 '2024-05-31 14:27:34’语句执行失败。具体报错如下：
原因：
SQL 错误 [1064] [42000]: Unexpected exception: Partition [p20240315] is changed during truncating table, please retry

前后半小时fe详细日志：
output.log (69.6 MB)

3、be.INFO日志 tail -500f be.INFO |grep -E ‘task_count_in_queue=[2-9][0-9]{3,}’
没有找到数据

华_天涯 · 2024年05月31日 06:38

另外补充下，同一个表第一次truncate 3个分区在2秒钟左右；一分钟后执行第二次truncate同样的三个分区，可能就需要很长时间3分钟或者更久。重复这个过程，有时候快，有时候慢，有时候就直接失败，有时候根本就执行不完，只能手动的kill掉进程。

案例同上述发日志的案例。

华_天涯 · 2024年05月31日 06:50

因为sr集群是从2.5.8升级到3.1.11，升级完之后没有再重启过。像这个问题重启的话有没有可能解决？因为是生产环境，所以没有贸然重启。

华_天涯 · 2024年06月15日 02:38

你好，这个问题的进展如何？有没有解决问题的建议？