存算分离的CN节点 重启后 Table creation timed out

【详述】存算分离的CN节点 重启后 执行sql select into 导入数据 经常会报 Table creation timed out
【背景】 已经按报错信息 增加了 tablet_create_timeout_second 还是不行
【业务影响】 数据无法导入
【是否存算分离】 是
【StarRocks版本】例如:3.2.11
【集群规模】例如:3fe(1 follower+2observer)+3be
【机器信息】fe 8C32G CN 16C128G
【联系方式】972592651@qq.com

java.sql.SQLSyntaxErrorException: Table creation timed out.
You can increase the timeout by increasing the config “tablet_create_timeout_second” and try again.
To increase the config “tablet_create_timeout_second” (currently 20), run the following command:

admin set frontend config("tablet_create_timeout_second"="40")

or add the following configuration to the fe.conf file and restart the process:

tablet_create_timeout_second=40
  1. 建表语句发一下,字段可省略
  2. 导入sql发一下,一次导入多少个分区的数据

我们是一次性的 全量导入
SET query_timeout=3600;
SET enable_spill = “true”;
SET spill_mode = “force” ;
drop table if exists tmp_store_order_amt_item_stat_bi ;
create table if not exists tmp_store_order_amt_item_stat_bi as
select t1.posno
,t1.flowno
,t1.total_realamt
,coalesce(max_payment_type,0) as payment_type
,coalesce(t2.xj,0) as xj
,coalesce(t2.jf,0) as jf
,coalesce(t2.lkl,0) as lkl
,coalesce(t2.yhk,0) as yhk
,coalesce(t2.smzf,0) as smzf
,coalesce(t2.scq,0) as scq
,coalesce(t2.gsq,0) as gsq
,coalesce(t2.zxzf,0) as zxzf
,coalesce(t2.dft,0) as dft
,coalesce(t2.dlyq,0) as dlyq
,coalesce(t2.hdq,0) as hdq
,coalesce(t2.coupon,0) as coupon
,coalesce(t2.alipay_card,0) as alipay_card
from (
select posno,flowno,sum(realamt) as total_realamt
from cn_ods.ods_hd40_buy2s
group by posno,flowno
)t1
left outer join
(
select posno,flowno,
max(if(rn = 1,payment_type,0)) as max_payment_type,
sum(case when payment_type in (1) then amt else 0 end ) as xj ,
sum(case when payment_type in (2) then amt else 0 end ) as jf ,
sum(case when payment_type in (3) then amt else 0 end ) as lkl ,
sum(case when payment_type in (4) then amt else 0 end ) as yhk ,
sum(case when payment_type in (5) then amt else 0 end ) as smzf ,
sum(case when payment_type in (6) then amt else 0 end ) as scq ,
sum(case when payment_type in (7) then amt else 0 end ) as gsq ,
sum(case when payment_type in (8) then amt else 0 end ) as zxzf ,
sum(case when payment_type in (9) then amt else 0 end ) as dft ,
sum(case when payment_type in (10) then amt else 0 end ) as dlyq ,
sum(case when payment_type in (11) then amt else 0 end ) as hdq,
sum(case when payment_type in (9999) then amt else 0 end ) as coupon,
sum(case when payment_type in (15) then amt else 0 end ) as alipay_card

from(
select *,row_number() over(partition by posno,flowno order by amt desc) as rn
from (
select posno,flowno,if(currency in (1,-1),1,currency) as payment_type,sum(amount) as amt
from cn_ods.ods_hd40_buy11s
group by posno,flowno,if(currency in (1,-1),1,currency)) a
) b

group by posno,flowno
)t2 on t1.posno=t2.posno and t1.flowno=t2.flowno
;

目前我们暂时的解决了 是需要重启fe的master节点就不会有这个问题了 想问下 是只能这样解决吗

正常应该不用重启fe,看下这个任务的表 create时 需要创建多少个 tablet,与此同时集群是否有其他task也在占用线程创建tablet
可以用tablename 或者 此次写入任务的事务txnid 等关键词到leader fe.log 中搜一下,看看有没有其他异常日志

还有个问题 上边是我们的导入sql 从几个表查出来 总共两个多亿的数据 然后插入另一张表 目前用 3台128G的服务器 执行一个小时左右就超时 失败了 也没有报错日志 也不知道卡在哪了 监控服务器的内存如下 能看出哪的问题吗 是我们的load调的太大了吗

有没有好的优化建议啊 我们调整了 分区 以月为单位分区 应该也不是 tablet太多的原因吧

执行时间超时导致的吧,把 query_timeout 再调大点,可以打profile看看https://docs.starrocks.io/zh/docs/administration/query_profile_text_based_analysis/