[Database.logTryLockFailureEvent():148] try db lock failed.后集群性能下降,查询不能

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】[Database.logTryLockFailureEvent():148] try db lock failed.
【背景】routine load 导入失败
【业务影响】
【StarRocks版本】3.5.6
【集群规模】1fe+5be
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】4群
【附件】
2023-08-23 10:52:44,698 WARN (thrift-server-pool-188504|204581) [FrontendServiceImpl.loadTxnCommitImpl():1027] txn 9100872 publish timeout errors: {tablet:19929 quorum:2 version:203259 #replica:3 err: 10.21.49.163:203258 10.21.49.217:22023-08-23 10:52:44,707 WARN (thrift-server-pool-188505|204582) [FrontendServiceImpl.loadTxnCommitImpl():1027] txn 9100875 publish timeout errors: {tablet:21879 quorum:2 version:218000 #replica:3 err: 10.21.49.186:217999 10.21.49.163:22023-08-23 10:52:44,724 WARN (thrift-server-pool-188506|204583) [FrontendServiceImpl.loadTxnCommitImpl():1027] txn 9100873 publish timeout errors: {tablet:19539 quorum:2 version:203566 #replica:3 err: 10.21.49.144:203565 10.21.49.217:22023-08-23 10:52:44,747 WARN (thrift-server-pool-188507|204584) [FrontendServiceImpl.loadTxnCommitImpl():1027] txn 9100876 publish timeout errors: {tablet:20319 quorum:2 version:202481 #replica:3 err: 10.21.49.127:202480 10.21.49.144:22023-08-23 10:52:44,757 WARN (thrift-server-pool-188508|204585) [Database.logTryLockFailureEvent():148] try db lock failed. type: writeLock, current owner id: 204584, owner name: thrift-server-pool-188507, owner stack: dump thread: thr
java.util.Spliterator.getExactSizeIfKnown(Spliterator.java:408)
java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
com.starrocks.load.routineload.RoutineLoadJob.afterCommitted(RoutineLoadJob.java:801)
com.starrocks.transaction.TransactionState.afterStateTransform(TransactionState.java:495)
com.starrocks.transaction.DatabaseTransactionMgr.commitTransaction(DatabaseTransactionMgr.java:468)
com.starrocks.transaction.GlobalTransactionMgr.commitTransaction(GlobalTransactionMgr.java:360)
com.starrocks.transaction.GlobalTransactionMgr.commitAndPublishTransaction(GlobalTransactionMgr.java:438)
com.starrocks.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1017)
com.starrocks.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:976)
com.starrocks.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:2656)
com.starrocks.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:2636)
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-08-23 10:52:44,757 WARN (thrift-server-pool-188508|204585) [FrontendServiceImpl.loadTxnCommit():978] failed to commit txn_id: 9100874: get database write lock timeout, database=o_biz, timeoutMillis=15000
2023-08-23 10:52:44,757 WARN (thrift-server-pool-179135|194609) [Database.logSlowLockEventIfNeeded():141] slow db lock. type: readLock, db id: 10754, db name: o_biz, wait time: 15001ms, former owner id: 204584, owner name: thrift-serve
java.lang.Throwable.fillInStackTrace(Native Method)
java.lang.Throwable.fillInStackTrace(Throwable.java:784)
java.lang.Throwable.(Throwable.java:251)
org.apache.logging.log4j.util.StackLocator.calcLocation(StackLocator.java:196)
org.apache.logging.log4j.util.StackLocatorUtil.calcLocation(StackLocatorUtil.java:99)
org.apache.logging.log4j.spi.AbstractLogger.getLocation(AbstractLogger.java:2216)
org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2034)
org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1899)
org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1444)
com.starrocks.transaction.DatabaseTransactionMgr.commitTransaction(DatabaseTransactionMgr.java:479)
com.starrocks.transaction.GlobalTransactionMgr.commitTransaction(GlobalTransactionMgr.java:360)
com.starrocks.transaction.GlobalTransactionMgr.commitAndPublishTransaction(GlobalTransactionMgr.java:438)
com.starrocks.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1017)
com.starrocks.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:976)
com.starrocks.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:2656)
com.starrocks.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:2636)
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)

获取锁失败了,导入任务有增加么

是的,有比较多的导入任务。

你好,请问最后是怎么解决的呢

你好,请问这个问题应该怎么解决呢?针对导入任务过多的情况,如果是kafka routine load换成stream load,情况会缓解吗?

你好,请教一下,这个问题该怎么解决,当前集群版本3.1.6当存在过多的导入任务时,会出现这个问题

这个一般是写入慢或者并发高导致,如果是存算一体,可以调整以下参数,提高写入效率,不过会增加磁盘的负载

flush_thread_num_per_store=8 // 每个盘的flush线程数,当盘比较少时可以设置较大,盘较多时设置较小,一般情况下 flush_thread_num_per_store * 磁盘个数 < be_cpu_core_num / 2

number_tablet_writer_threads = 16 // 默认值16,[16-48],一般设置为cpu核数的1/3左右。

routine load调优

  • 降低导入QPS,集群总体的导入QPS尽量< 10

    • 计算方式: 集群 routine_load_task_num / routine_load_task_consume_second
  • 增加单个导入事务的数据量,单个Routine Load Task导入的数据量 > 1G

    • 需要同时调整 max_routine_load_batch_size,routine_load_task_timeout_second 来实现
  • 单个BE上并发导入任务 routine_load_thread_pool_size 尽量< be_core_num / 2