【死锁】集群突然发生死锁问题导致查询错误

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】集群突然发送死锁,导致查询数据失败,streamload 失败。
【背景】无特别操作。
【业务影响】
【是否存算分离】否
【StarRocks版本】2.5.10
【集群规模】例如:3fe+6be
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】

FE master节点线程数上涨:
image
重启master节点后恢复。
FE日志出现大量:[FrontendServiceImpl.streamLoadPut():1215] failed to get stream load plan: get database read lock timeout, database=xxx
以及streamload tnx 的失败。导致flink 无法正常写入数据。

在出现read timeout之后的堆栈信息:
2023-12-19 17:57:56,158 WARN (thrift-server-pool-51533244|52826762) [TIOStreamTransport.close():110] Error closing output stream.
java.net.SocketException: Socket closed
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) ~[?:?]
at java.net.SocketOutputStream.write(SocketOutputStream.java:150) ~[?:?]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) ~[?:?]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142) ~[?:?]
at java.io.FilterOutputStream.close(FilterOutputStream.java:182) ~[?:?]
at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:108) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TSocket.close(TSocket.java:235) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:326) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
2023-12-19 17:57:56,961 WARN (nioEventLoopGroup-6-9|128) [TransactionLoadAction.executeWithoutPassword():101] com.starrocks.common.UserException: publish timeout: 20000
at com.starrocks.transaction.GlobalTransactionMgr.commitPreparedTransaction(GlobalTransactionMgr.java:431)
at com.starrocks.http.rest.TransactionLoadAction.executeTransaction(TransactionLoadAction.java:153)
at com.starrocks.http.rest.TransactionLoadAction.executeWithoutPassword(TransactionLoadAction.java:90)
at com.starrocks.http.rest.RestBaseAction.execute(RestBaseAction.java:85)
at com.starrocks.http.rest.RestBaseAction.handleRequest(RestBaseAction.java:60)
at com.starrocks.http.HttpServerHandler.channelRead(HttpServerHandler.java:80)

第一次出现read lock 之后打印的异常信息。
fe.warn.log (2).20231219-1 (122.5 KB)

升级到2.5最新小版本吧,2.5.13修复过一个stream load的死锁问题,https://github.com/StarRocks/starrocks/pull/31746

我们看到2.3版本的,也有过修复的记录。
https://github.com/StarRocks/starrocks/pull/18992
https://github.com/StarRocks/starrocks/pull/16748
这两个相关吗?
我们如何在现在的版本下复现这个问题?我们担心升级之后,还是这样的问题。

你好,我们最近刚刚升级了2.5.17版本,其中加入了 LockChecker.java。然后又发生了一次读的锁。其中关键打印信息:
2024-03-14 09:26:04,267 INFO (DeadlockChecker|38) [LockChecker.checkSlowLocks():109] slow db locks: [{“lockState”:“readLocked”,“slowReadLockCount”:1,“dumpThreads”:“lockHoldTime: 28842 ms;dump thread: starrocks-taskrun-pool-3225, id: 1579979\n java.base@11.0.21/jdk.internal.misc.Unsafe.park(Native Method)\n java.base@11.0.21/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)\n java.base@11.0.21/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)\n java.base@11.0.21/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:1009)\n java.base@11.0.21/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1324)\n java.base@11.0.21/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)\n app//com.starrocks.common.util.QueryableReentrantReadWriteLock.sharedLock(QueryableReentrantReadWriteLock.java:30)\n app//com.starrocks.catalog.Database.readLock(Database.java:156)\n app//com.starrocks.sql.StatementPlanner.lock(StatementPlanner.java:239)\n app//com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.prepareRefreshPlan(PartitionBasedMvRefreshProcessor.java:218)\n app//com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.doMvRefresh(PartitionBasedMvRefreshProcessor.java:183)\n app//com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.processTaskRun(PartitionBasedMvRefreshProcessor.java:128)\n app//com.starrocks.scheduler.TaskRun.executeTaskRun(TaskRun.java:135)\n app//com.starrocks.scheduler.TaskRunExecutor.lambda$executeTaskRun$0(TaskRunExecutor.java:36)\n app//com.starrocks.scheduler.TaskRunExecutor$$Lambda$1186/0x00000008409ae840.get(Unknown Source)\n java.base@11.0.21/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)\n java.base@11.0.21/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n java.base@11.0.21/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n java.base@11.0.21/java.lang.Thread.run(Thread.java:829)\n;”,“lockDbName”:“dsj_ods”,“lockWaiters”:[{“threadId”:1580470,“threadName”:“thrift-server-pool-1500252”},{“threadId”:1580460,“threadName”:“thrift-server-pool-1500242”},{“threadId”:1580442,“threadName”:“thrift-server-pool-1500224”},{“threadId”:1580440,“threadName”:“thrift-server-pool-1500222”},{“threadId”:1580400,“threadName”:“thrift-server-pool-1500182”},{“threadId”:1580433,“threadName”:“thrift-server-pool-1500215”},{“threadId”:1580432,“threadName”:“thrift-server-pool-1500214”},{“threadId”:1580431,“threadName”:“thrift-server-pool-1500213”},{“threadId”:1580430,“threadName”:“thrift-server-pool-1500212”},{“threadId”:1580402,“threadName”:“thrift-server-pool-1500184”},{“threadId”:1580420,“threadName”:“thrift-server-pool-1500202”},{“threadId”:1580399,“threadName”:“thrift-server-pool-1500181”},{“threadId”:1580419,“threadName”:“thrift-server-pool-1500201”},{“threadId”:1580418,“threadName”:“thrift-server-pool-1500200”},{“threadId”:1580397,“threadName”:“thrift-server-pool-1500179”},{“threadId”:1580417,“threadName”:“thrift-server-pool-1500199”},{“threadId”:1580416,“threadName”:“thrift-server-pool-1500198”},{“threadId”:1580415,“threadName”:“thrift-server-pool-1500197”},{“threadId”:1580385,“threadName”:“thrift-server-pool-1500167”},{“threadId”:1580401,“threadName”:“thrift-server-pool-1500183”},{“threadId”:1580406,“threadName”:“thrift-server-pool-1500188”},{“threadId”:1580405,“threadName”:“thrift-server-pool-1500187”},{“threadId”:1580398,“threadName”:“thrift-server-pool-1500180”},{“threadId”:1580396,“threadName”:“thrift-server-pool-1500178”},{“threadId”:1580395,“threadName”:“thrift-server-pool-1500177”},{“threadId”:1580392,“threadName”:“thrift-server-pool-1500174”},{“threadId”:1580391,“threadName”:“thrift-server-pool-1500173”},{“threadId”:1580376,“threadName”:“thrift-server-pool-1500158”},{“threadId”:1580390,“threadName”:“thrift-server-pool-1500172”},{“threadId”:1580389,“threadName”:“thrift-server-pool-1500171”},{“threadId”:1580388,“threadName”:“thrift-server-pool-1500170”},{“threadId”:1580387,“threadName”:“thrift-server-pool-1500169”},{“threadId”:1580386,“threadName”:“thrift-server-pool-1500168”},{“threadId”:1580383,“threadName”:“thrift-server-pool-1500165”},{“threadId”:1580375,“threadName”:“thrift-server-pool-1500157”},{“threadId”:1580371,“threadName”:“thrift-server-pool-1500153”},{“threadId”:1580370,“threadName”:“thrift-server-pool-1500152”},{“threadId”:1580369,“threadName”:“thrift-server-pool-1500151”},{“threadId”:1580361,“threadName”:“thrift-server-pool-1500143”},{“threadId”:1580359,“threadName”:“thrift-server-pool-1500141”},{“threadId”:1580357,“threadName”:“thrift-server-pool-1500139”},{“threadId”:1580356,“threadName”:“thrift-server-pool-1500138”},{“threadId”:1580355,“threadName”:“thrift-server-pool-1500137”},{“threadId”:1580354,“threadName”:“thrift-server-pool-1500136”},{“threadId”:1580339,“threadName”:“thrift-server-pool-1500121”},{“threadId”:1580331,“threadName”:“thrift-server-pool-1500113”},{“threadId”:1580325,“threadName”:“thrift-server-pool-1500107”},{“threadId”:1580324,“threadName”:“thrift-server-pool-1500106”},{“threadId”:1580323,“threadName”:“thrift-server-pool-1500105”},{“threadId”:1580321,“threadName”:“thrift-server-pool-1500103”},{“threadId”:1580320,“threadName”:“thrift-server-pool-1500102”},{“threadId”:1580319,“threadName”:“thrift-server-pool-1500101”},{“threadId”:1580318,“threadName”:“thrift-server-pool-1500100”},{“threadId”:1580317,“threadName”:“thrift-server-pool-1500099”},{“threadId”:1580316,“threadName”:“thrift-server-pool-1500098”},{“threadId”:1580315,“threadName”:“thrift-server-pool-1500097”},{“threadId”:1580314,“threadName”:“thrift-server-pool-1500096”},{“threadId”:1580313,“threadName”:“thrift-server-pool-1500095”},{“threadId”:1580312,“threadName”:“thrift-server-pool-1500094”},{“threadId”:1580311,“threadName”:“thrift-server-pool-1500093”},{“threadId”:1580299,“threadName”:“thrift-server-pool-1500081”},{“threadId”:1580284,“threadName”:“thrift-server-pool-1500066”},{“threadId”:1580283,“threadName”:“thrift-server-pool-1500065”},{“threadId”:1580264,“threadName”:“thrift-server-pool-1500047”},{“threadId”:1580270,“threadName”:“thrift-server-pool-1500053”},{“threadId”:1580280,“threadName”:“thrift-server-pool-1500063”},{“threadId”:1580279,“threadName”:“thrift-server-pool-1500062”},{“threadId”:1580262,“threadName”:“thrift-server-pool-1500045”},{“threadId”:1580269,“threadName”:“thrift-server-pool-1500052”},{“threadId”:1580263,“threadName”:“thrift-server-pool-1500046”},{“threadId”:1580261,“threadName”:“thrift-server-pool-1500044”},{“threadId”:1580260,“threadName”:“thrift-server-pool-1500043”},{“threadId”:1580259,“threadName”:“thrift-server-pool-1500042”},{“threadId”:1580258,“threadName”:“thrift-server-pool-1500041”},{“threadId”:1580257,“threadName”:“thrift-server-pool-1500040”},{“threadId”:1580245,“threadName”:“thrift-server-pool-1500028”},{“threadId”:1580236,“threadName”:“thrift-server-pool-1500019”},{“threadId”:1580235,“threadName”:“thrift-server-pool-1500018”},{“threadId”:1580223,“threadName”:“thrift-server-pool-1500006”},{“threadId”:1580226,“threadName”:“thrift-server-pool-1500009”},{“threadId”:1580165,“threadName”:“thrift-server-pool-1499948”},{“threadId”:1580147,“threadName”:“thrift-server-pool-1499930”},{“threadId”:1580164,“threadName”:“thrift-server-pool-1499947”},{“threadId”:1580163,“threadName”:“thrift-server-pool-1499946”},{“threadId”:1580158,“threadName”:“thrift-server-pool-1499941”},{“threadId”:1579913,“threadName”:“thrift-server-pool-1499742”},{“threadId”:1580148,“threadName”:“thrift-server-pool-1499931”},{“threadId”:1580143,“threadName”:“thrift-server-pool-1499926”},{“threadId”:1579834,“threadName”:“thrift-server-pool-1499672”},{“threadId”:1580121,“threadName”:“thrift-server-pool-1499904”},{“threadId”:1580144,“threadName”:“thrift-server-pool-1499927”},{“threadId”:1580129,“threadName”:“thrift-server-pool-1499912”},{“threadId”:1580118,“threadName”:“thrift-server-pool-1499901”},{“threadId”:1580037,“threadName”:“thrift-server-pool-1499820”},{“threadId”:1580120,“threadName”:“thrift-server-pool-1499903”},{“threadId”:1580090,“threadName”:“thrift-server-pool-1499873”},{“threadId”:1580036,“threadName”:“thrift-server-pool-1499819”},{“threadId”:1580039,“threadName”:“thrift-server-pool-1499822”},{“threadId”:32,“threadName”:“tablet scheduler”},{“threadId”:1579824,“threadName”:“starrocks-taskrun-pool-3222”},{“threadId”:1579918,“threadName”:“starrocks-taskrun-pool-3224”},{“threadId”:33,“threadName”:“tablet checker”},{“threadId”:1579752,“threadName”:“starrocks-taskrun-pool-3220”},{“threadId”:21,“threadName”:“PUBLISH_VERSION”}]}]

2024-03-14 09:26:04,292 INFO (thrift-server-pool-1500217|1580435) [FrontendServiceImpl.loadTxnRollback():1201] receive txn rollback request. db: dsj_ods, tbl: to_zt_saas_wk_process_info, txn_id: 122994078, reason: call frontend service failed, address=TNetworkAddress(hostname=10.21.54.64, port=9020), reason=THRIFT_EAGAIN (timed out), backend: 10.21.49.140

2024-03-14 09:26:04,298 INFO (thrift-server-pool-1500217|1580435) [DatabaseTransactionMgr.abortTransaction():1289] transaction:[TransactionState. txn_id: 122994078, label: 63e77d1c-3207-4540-9666-f1003f66144e, db id: 10233, table id list: 273834538, callback id: -1, coordinator: BE: 10.21.49.140, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1710379554081, commit time: -1, finish time: 1710379564292, total cost: 10211ms, reason: call frontend service failed, address=TNetworkAddress(hostname=10.21.54.64, port=9020), reason=THRIFT_EAGAIN (timed out) attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@54affa0] successfully rollback

2024-03-14 09:26:04,310 INFO (thrift-server-pool-1500254|1580472) [ReportHandler.putToQueue():288] update be 170609220 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,310 INFO (thrift-server-pool-1500255|1580473) [ReportHandler.putToQueue():288] update be 263705847 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,310 INFO (thrift-server-pool-1500256|1580474) [ReportHandler.putToQueue():288] update be 263701635 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,313 INFO (thrift-server-pool-1500257|1580475) [ReportHandler.putToQueue():288] update be 263703201 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,323 INFO (thrift-server-pool-1500258|1580476) [ReportHandler.putToQueue():288] update be 170610846 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,340 INFO (thrift-server-pool-1500259|1580477) [ReportHandler.putToQueue():288] update be 309199653 report task, type: RESOURCE_USAGE_REPORT

2024-03-14 09:26:04,401 INFO (thrift-server-pool-1500269|1580487) [FrontendServiceImpl.streamLoadPut():1262] receive stream load put request. db:dsj_ods, tbl: xxxx, txn_id: 122994062, load id: 1e419374-c4f8-0e16-f544-f4a1c2d949b2, backend: 10.21.51.180

2024-03-14 09:26:04,401 WARN (thrift-server-pool-1499822|1580039) [Database.logTryLockFailureEvent():150] try db lock failed. type: readLock, current

2024-03-14 09:26:04,401 WARN (thrift-server-pool-1499822|1580039) [FrontendServiceImpl.streamLoadPut():1273] failed to get stream load plan: get database read lock timeout, database=dsj_ods

Untitled-2.json (26.8 KB) 其中一次lockstate的打印。

@jingdan @trueeyu 相同遇到这个问题,FE master 节点的JVM thread 数被打爆了,导致所有请求一直被丢弃,timeout 。但是看这个pr是一个解决BE问题的。

什么版本,3.1?

2.5.13 版本

突然打爆

是不是那个时间点有很多union all 或是union的insert into select sql啊

如果是这个情况的话,把fe thread线程数调大,或是把pipeline_dop调小

你这个线程数已经接近临界值了,先改下 fe.conf thrift_server_max_worker_threads=8192; 调成吧,默认4096

是有很多union all ,报表看板的数据,但是白天晚上一直有。大多数都会出现在晚上没有用报表的时间FE thread 数打爆这是为啥呢

是个已知问题,原因比较复杂,可以加我细聊。当前规避的话,可以set global pipeline_dop=4 或更小来缓解。同时调大thrift_server_max_worker_threads

晚上的话,是不是inset into select,白天大部分是select?