【查询出现毛刺】正常查询发现不定时出现耗时较长毛刺现象

日志中看不出来什么信息:

BE节点3:
I0909 14:08:24.456902 25310 fragment_executor.cpp:162] Prepare(): query_id=ec2ce640-6e71-11ef-999c-5254002c4bf9 fragment_instance_id=ec2ce640-6e71-11ef-999c-5254002c4bfd backend_num=0
I0909 14:08:24.458081 25393 fragment_executor.cpp:162] Prepare(): query_id=ec2ce640-6e71-11ef-999c-5254002c4bf9 fragment_instance_id=ec2ce640-6e71-11ef-999c-5254002c4bfc backend_num=3

BE节点2:
I0909 14:08:24.455874 56274 fragment_executor.cpp:162] Prepare(): query_id=ec2ce640-6e71-11ef-999c-5254002c4bf9 fragment_instance_id=ec2ce640-6e71-11ef-999c-5254002c4bfa backend_num=1

BE节点1:
I0909 14:08:24.454375 269174 fragment_executor.cpp:162] Prepare(): query_id=ec2ce640-6e71-11ef-999c-5254002c4bf9 fragment_instance_id=ec2ce640-6e71-11ef-999c-5254002c4bfb backend_num=2


FE日志:
2024-09-09 14:08:24,452 INFO (starrocks-mysql-nio-pool-33491|392082) [MetadataMgr.removeQueryMetadata():113] Succeed to deregister query level connector metadata on query id: ec2ce640-6e71-11ef-999c-5254002c4bf9
2024-09-09 14:08:24,452 INFO (starrocks-mysql-nio-pool-33491|392082) [QeProcessorImpl.registerQuery():83] register query id = ec2ce640-6e71-11ef-999c-5254002c4bf9, job: -1
2024-09-09 14:08:29,490 INFO (starrocks-mysql-nio-pool-33491|392082) [QeProcessorImpl.unregisterQuery():93] deregister query id ec2ce640-6e71-11ef-999c-5254002c4bf9
2024-09-09 14:08:29,490 INFO (starrocks-mysql-nio-pool-33491|392082) [MetadataMgr.removeQueryMetadata():113] Succeed to deregister query level connector metadata on query id: ec2ce640-6e71-11ef-999c-5254002c4bf9
2024-09-09 14:08:29,534 INFO (nioEventLoopGroup-6-58|221) [HttpServerHandler.channelRead():90] receive http request. url: /query_profile?query_id=ec2ce640-6e71-11ef-999c-5254002c4bf9, thread id: 221, startTime: 1725862109533, latency: 1 ms

fe.audit.log审计日志:
2024-09-09 14:08:29,490 [slow_query] |Client=xxxx:54270|User=root|AuthorizedUser='root'@'%'|ResourceGroup=default_wg|Catalog=default_catalog|Db=|State=EOF|ErrorCode=|Time=5040|ScanBytes=0|ScanRows=0|ReturnRows=1|CpuCostNs=4900406|MemCostBytes=4625294|StmtId=11805177|QueryId=ec2ce640-6e71-11ef-999c-5254002c4bf9|IsQuery=true|feIp=txhs-ls3-doris01.data|Stmt=select /*+ SET_VAR(pipeline_dop=4) */ sum(cash) from mobile_game.`t_app_pub_user_payment_ri_bak_20240909` where gameid = 291 and recharge_user = '9_109554117' and logtime >= '2024-04-12 21:30:45' and logtime <= '2024-04-26 04:00:00' /* dop_query */;|Digest=76141bb24cb70d3bf0ca115daa79888e|PlanCpuCost=63.467336209286785|PlanMemCost=2.0|IsForwardToLeader=false
  • DriverPrepareTime: 1s254ms

         - __MAX_OF_DriverPrepareTime: 5s5ms
    
         - __MIN_OF_DriverPrepareTime: 1.492ms

应该是主键表apply wait version慢的问题,3.1优化了

但是我这个查询的表是唯一模型,不是主键模型,执行计划那个也没日志看出来,这个也是偶然,p99都没问题,就是偶尔出现

EXCHANGE_SINK (plan_node_id=3):

      CommonMetrics:

         - CloseTime: 3.796us

           - __MAX_OF_CloseTime: 15.890us

           - __MIN_OF_CloseTime: 241ns

         - OperatorTotalTime: 18.607us

           - __MAX_OF_OperatorTotalTime: 50.787us

           - __MIN_OF_OperatorTotalTime: 1.763us

         - PeakMemoryUsage: 0.00 

         - PrepareTime: 1s254ms

           - __MAX_OF_PrepareTime: 5s5ms

           - __MIN_OF_PrepareTime: 1.458ms

看看是不是dns解析有问题

查看dns解析不存在问题,可以正常找到地址,且耗时也不高

大概率还是dns解析的问题,可以抓个卡住的pstack, 或者换成直连ip试下

上边的那个耗时长的profile就是通过IP直连一个节点的FE地址获取的结果,等知道延迟已经没执行线程了,抓取不到啥信息

你这个不是明显 DNS访问的吗
BackendAddresses: txhs-ls3-doris01.data:9060,txhs-ls3-doris03.data:9060,txhs-ls3-doris02.data:9060

这是集群配置了使用FQDN,所以集群内部是域名访问吧,我测试用的JDBC连接是FE的IP端口,你说的换成直连是要我调整集群为IP地址?这个不太能调整,线上集群服务,不能这么随便修改,有日志能定位你说的这种访问慢问题吗

先看下你们的域名解析监控吧