【详述】整个集群突然不能创建新分区了,离线导入程序和实时任务报错基本相同
create partitions failed: Table creation timed out.
You can increase the timeout by increasing the config "tablet_create_timeout_second" and try again.
To increase the config "tablet_create_timeout_second" (currently 60), run the following command:
admin set frontend config("tablet_create_timeout_second"="120")
or add the following configuration to the fe.conf file and restart the process:tablet_create_timeout_second=120
at com.starrocks.qe.StmtExecutor.handleDMLStmt(StmtExecutor.java:2241)
at com.starrocks.qe.StmtExecutor.handleDMLStmtWithProfile(StmtExecutor.java:1832)
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:689)
at com.starrocks.qe.ConnectProcessor.proxyExecute(ConnectProcessor.java:775)
at com.starrocks.service.FrontendServiceImpl.forward(FrontendServiceImpl.java:1314)
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4276)
at com.starrocks.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4256)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
【背景】做过哪些操作?
1.按照报错设置过admin set frontend config(“tablet_create_timeout_second”=“120”) 甚至更长时间还是不起作用
2.想建新表测试,表无法建立,一样是超时问题
3.尝试手动Insert into新分区数据来创建新分区也会报一样的超时错误
4.重启所有fe无效
5.删除所有物化视图,重启所有be,正常大概十几分钟,然后又是相同的超时错误
6.将集群从3.2.11升级到3.2.12,再次重启be 还是正常十几分钟 然后相同的错误
检查发现在每次重启的时候,有一台 be 节点机器的读写meta data数据速率直接飙升,将该机器的 BE 停止服务,表基本都是三副本,所以不影响
停止该机器后,一切正常,可以新建分区等操作,实时任务也正常
停止大概十几分钟之后,再将该 BE 服务重启,一切正常
【业务影响】
几乎所有业务,只要新建分区都受到影响
【是否存算分离】
否
【StarRocks版本】
3.2.11 以及升级后的 3.2.12 都存在
【集群规模】3fe+12be
【机器信息】
fe: 2c+14G
be: 32c+64G