[3.3.6]查询Hive, BE节点HDFS_SCAN性能远慢于Trino

starrrrrrrrrr · 2024年11月29日 06:30

【详述】
把Trino查询迁移到Starrocks时, 发现相同资源情况下，同一个SQL在trino中耗时2s+, 在Starrocks中耗时16s+.
SQL是简单的select查询，通过profile分析发现查询耗时集中在BE节点HDFS_SCAN阶段。读取了两个scan_range, 平均大小70MB, 一个读取的很快，一个读取的很慢。
从trino的查询分析来看，ufs没有瓶颈，怀疑是BE实现的问题，能帮忙一起分析下原因么

SQL

SELECT 'xxx_episode_id' AS "aid"
	, if('episode_en_name' = 'tv_name', 'tv_name', concat('tv_name', '</br>', 'episode_en_name')) AS "tv_name"
	, 'channel_id', 'channel_name', 'online_time', 'tag_place'
FROM table LIMIT 10;

看到OpenFile耗时久

| └──HDFS_SCAN (id=0) |
| Estimates: [row: 10, cpu: ?, memory: ?, network: ?, cost: 4320.0] |
| TotalTime: 17s39ms (100.00%) [CPUTime: 228.231ms, ScanTime: 16s811ms] |
| OutputRows: 20 |
| Detail Timers: [ScanTime = IOTaskExecTime + IOTaskWaitTime] |
| IOTaskExecTime: 8s565ms [min=319.212ms, max=16s811ms] |
| OpenFile: 8s267ms [min=165.493ms, max=16s369ms] |
| ReaderInit: 8s267ms [min=165.447ms, max=16s369ms] |
| IOTaskWaitTime: 84.239us [min=65.571us, max=102.908us]

【是否存算分离】存算分离
【StarRocks版本】3.3.6
【集群规模】1fe+6be
f1992e8d-ad88-11ef-97d1-fa163e451b63profile.txt (28.1 KB)

Trino读取的数据量总共34.68MB，Starrocks读取数据量135.612 MB。这也是性能差距的原因之一，但不因该有这么大差距
Trino Plan

Fragment 2 [SOURCE]
CPU: 2.02s, Scheduled: 3.29s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 443493 rows (0B); per task: avg.: 147831.00 std.dev.: 209064.61, O
Output layout: []
Output partitioning: SINGLE []
LimitPartial[count = 10]
│ Layout: []
│ Estimates: {rows: 10 (0B), cpu: 0, memory: 0B, network: 0B}
│ CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Blocked: 0.00ns (0.00%), Output: 10 rows (0B)
│ Input avg.: 147831.00 rows, Input std.dev.: 141.42%
└─ TableScan[table = xxxxxx]
Layout: []
Estimates: {rows: ? (0B), cpu: 0, memory: 0B, network: 0B}
CPU: 629.00ms (100.00%), Scheduled: 1.09s (100.00%), Blocked: 0.00ns (0.00%), Output: 443493 rows (0B)
Input avg.: 147831.00 rows, Input std.dev.: 141.42%
Input: 443493 rows (0B), Physical input: 34.68MB, Physical input time: 230.01ms

daxia · 2024年12月13日 07:53

大佬，有进展了吗。
我们这边又看到ReaderInit时间较长

zombee0 · 2024年12月17日 03:26

你好，请问你是使用的哪种存储格式？

daxia · 2024年12月17日 03:38

我们使用的是ORC

zombee0 · 2024年12月17日 06:05

你们配置上有什么更改吗？cache? coalesce? 等等
看起来也不像是卡在io上，readerinit耗时有些长，table schema可以发下吗？

daxia · 2024年12月17日 06:29

没有修改。使用拓扑看到就是HDFS SCAN时间较长。
schema我也没有，不过我们是访问hive外表的。
CONNECTOR_SCAN (plan_node_id=0):
CommonMetrics:
- JoinRuntimeFilterEvaluate: 0
- JoinRuntimeFilterHashTime: 0ns
- JoinRuntimeFilterInputRows: 0
- JoinRuntimeFilterOutputRows: 0
- JoinRuntimeFilterTime: 0ns
- OperatorTotalTime: 21.517ms
- __MAX_OF_OperatorTotalTime: 41.583ms
- __MIN_OF_OperatorTotalTime: 8.445ms
- PullChunkNum: 6
- __MAX_OF_PullChunkNum: 1
- __MIN_OF_PullChunkNum: 0
- PullRowNum: 6
- __MAX_OF_PullRowNum: 1
- __MIN_OF_PullRowNum: 0
- PullTotalTime: 21.440ms
- __MAX_OF_PullTotalTime: 41.509ms
- __MIN_OF_PullTotalTime: 8.341ms
- PushChunkNum: 0
- PushRowNum: 0
- PushTotalTime: 0ns
- RuntimeBloomFilterNum: 0
- RuntimeInFilterNum: 0
UniqueMetrics:
- AdaptiveIOTasks: False
- DataSourceType: HiveDataSource
- MorselQueueType: dynamic_morsel_queue
- Predicates: get_json_object(27: properties, ‘$.numList’) LIKE ‘%18649060862%’
- PredicatesPartition
- SharedScan: False
- Table: event_inc_orc
- ORCSearchArgument: : expr = YES_NO_NULL
- ChunkBufferCapacity: 2.048K (2048)
- DefaultChunkBufferCapacity: 2.048K (2048)
- IOTaskExecTime: 14m34s
- __MAX_OF_IOTaskExecTime: 26m33s
- __MIN_OF_IOTaskExecTime: 6m18s
- ColumnConvertTime: 1s417ms
- __MAX_OF_ColumnConvertTime: 7s13ms
- __MIN_OF_ColumnConvertTime: 403.574ms
- ColumnReadTime: 12m25s
- __MAX_OF_ColumnReadTime: 22m46s
- __MIN_OF_ColumnReadTime: 4m45s
- DynamicPruneScanRange:
- JoinRuntimeFilterEvaluate: 0
- JoinRuntimeFilterHashTime: 0ns
- JoinRuntimeFilterInputScanRanges: 0
- JoinRuntimeFilterOutputScanRanges: 0
- JoinRuntimeFilterTime: 0ns
- ExprFilterTime: 29s166ms
- __MAX_OF_ExprFilterTime: 55s264ms
- __MIN_OF_ExprFilterTime: 14s418ms
- InputStream:
- AppIOBytesRead: 217.109 GB
- __MAX_OF_AppIOBytesRead: 2.032 GB
- __MIN_OF_AppIOBytesRead: 533.785 MB
- AppIOCounter: 154.908K (154908)
- __MAX_OF_AppIOCounter: 1.434K (1434)
- __MIN_OF_AppIOCounter: 383
- AppIOTime: 13m53s
- __MAX_OF_AppIOTime: 25m38s
- __MIN_OF_AppIOTime: 5m55s
- FSIOBytesRead: 217.236 GB
- __MAX_OF_FSIOBytesRead: 2.035 GB
- __MIN_OF_FSIOBytesRead: 533.971 MB
- FSIOCounter: 58.428K (58428)
- __MAX_OF_FSIOCounter: 537
- __MIN_OF_FSIOCounter: 144
- FSIOTime: 13m53s
- __MAX_OF_FSIOTime: 25m38s
- __MIN_OF_FSIOTime: 5m55s
- LateMaterializeSkipRows: 1.403B (1402629770)
- __MAX_OF_LateMaterializeSkipRows: 13.108M (13108478)
- __MIN_OF_LateMaterializeSkipRows: 3.354M (3353519)
- MemAllocFailed: 0
- ORC:
- IcebergV2FormatTimer:
- DeleteFileBuildFilterTime: 13.239ms
- __MAX_OF_DeleteFileBuildFilterTime: 60.918ms
- __MIN_OF_DeleteFileBuildFilterTime: 4.454ms
- DeleteFileBuildTime: 0ns
- DeleteFilesPerScan: 0
- StripeActiveLazyColumnIOCoalesceSeperately: 10.737K (10737)
- __MAX_OF_StripeActiveLazyColumnIOCoalesceSeperately: 103
- __MIN_OF_StripeActiveLazyColumnIOCoalesceSeperately: 25
- StripeActiveLazyColumnIOCoalesceTogether: 1.615K (1615)
- __MAX_OF_StripeActiveLazyColumnIOCoalesceTogether: 25
- __MIN_OF_StripeActiveLazyColumnIOCoalesceTogether: 0
- TotalStripeNumber: 12.352K (12352)
- __MAX_OF_TotalStripeNumber: 113
- __MIN_OF_TotalStripeNumber: 31
- TotalStripeSize: 348.773 GB
- __MAX_OF_TotalStripeSize: 3.262 GB
- __MIN_OF_TotalStripeSize: 856.192 MB
- TotalTinyStripeSize: 0.000 B
- OpenFile: 1m38s
- __MAX_OF_OpenFile: 6m51s
- __MIN_OF_OpenFile: 16s875ms
- RawRowsRead: 1.403B (1402654346)
- __MAX_OF_RawRowsRead: 13.113M (13112574)
- __MIN_OF_RawRowsRead: 3.354M (3353519)
- ReaderInit: 1m38s
- __MAX_OF_ReaderInit: 6m51s
- __MIN_OF_ReaderInit: 16s874ms
- RowsRead: 6
- __MAX_OF_RowsRead: 1
- __MIN_OF_RowsRead: 0
- ScanRanges: 7.049K (7049)
- __MAX_OF_ScanRanges: 66
- __MIN_OF_ScanRanges: 17
- ScanRangesSize: 365.719 GB
- __MAX_OF_ScanRangesSize: 3.446 GB
- __MIN_OF_ScanRangesSize: 908.635 MB
- SharedBuffered:
- DirectIOBytes: 213.283 MB
- __MAX_OF_DirectIOBytes: 2.297 MB
- __MIN_OF_DirectIOBytes: 525.999 KB
- DirectIOCount: 21.939K (21939)
- __MAX_OF_DirectIOCount: 219
- __MIN_OF_DirectIOCount: 54
- DirectIOTime: 3m56s
- __MAX_OF_DirectIOTime: 9m57s
- __MIN_OF_DirectIOTime: 1m11s
- SharedAlignIOBytes: 0.000 B
- SharedIOBytes: 217.027 GB
- __MAX_OF_SharedIOBytes: 2.033 GB
- __MIN_OF_SharedIOBytes: 533.458 MB
- SharedIOCount: 36.489K (36489)
- __MAX_OF_SharedIOCount: 338
- __MIN_OF_SharedIOCount: 90
- SharedIOTime: 9m57s
- __MAX_OF_SharedIOTime: 18m26s
- __MIN_OF_SharedIOTime: 3m15s
- IOTaskWaitTime: 33.986ms
- __MAX_OF_IOTaskWaitTime: 166.453ms
- __MIN_OF_IOTaskWaitTime: 11.028ms
- MorselsCount: 9.178K (9178)
- __MAX_OF_MorselsCount: 107
- __MIN_OF_MorselsCount: 22
- PeakChunkBufferMemoryUsage: 0.000 B
- PeakChunkBufferSize: 1.194K (1194)
- PeakIOTasks: 37
- __MAX_OF_PeakIOTasks: 64
- __MIN_OF_PeakIOTasks: 19
- PeakScanTaskQueueSize: 9.730K (9730)
- __MAX_OF_PeakScanTaskQueueSize: 97
- __MIN_OF_PeakScanTaskQueueSize: 4
- PrepareChunkSourceTime: 2.445ms
- __MAX_OF_PrepareChunkSourceTime: 18.777ms
- __MIN_OF_PrepareChunkSourceTime: 269.902us
- ScanTime: 4h6m
- __MAX_OF_ScanTime: 12h11m
- __MIN_OF_ScanTime: 53m14s
- SubmitTaskCount: 9.184K (9184)
- __MAX_OF_SubmitTaskCount: 107
- __MIN_OF_SubmitTaskCount: 22
- SubmitTaskTime: 16.350ms
- __MAX_OF_SubmitTaskTime: 32.164ms
- __MIN_OF_SubmitTaskTime: 6.849ms
- TabletCount: 2.538K (2538)
- __MAX_OF_TabletCount: 462

zombee0 · 2024年12月17日 11:11

你新发的这个profile看起来跟你之前发的profile是不太一样的，这个更多的是卡在IO上，我不知道你这个sql是啥，这个有trino的profile吗？

starrrrrrrrrr · 2024年12月18日 14:04

@zombee0 这个Hive表的schema如下：

 CREATE TABLE table (  
    qipu_episode_id bigint,                                        
    tv_name varchar,                                               
    episode_en_name varchar,                                       
    qipu_album_id bigint,                                          
    album_name varchar,                                            
    album_en_name varchar,                                         
    channel_id integer,                                            
    channel_name varchar,                                          
    online_time varchar,                                           
    tag_place varchar,                                             
    tag_place_new varchar,                                         
    channel_en_name varchar,                                       
    tag_en_place varchar,                                          
    content_type varchar   ,        
    upload_video_type varchar   ,         
    play_mode varchar ,     
    duration bigint ,                  
    episode_type_v2 varchar,
    is_international boolean   
 )                                                                 
 WITH (                                                            
    format = 'SEQUENCEFILE'                                        
 )

starrrrrrrrrr · 2024年12月18日 13:55

@daxia 根据你的profile来看，读取的是iceberg v2表
问题1：从TotalStripeSize MIN/MAX来看差距有4倍，检查下文件大小是否有显著的倾斜情况
问题2：从TotalStripeNumber看有1w+ , 检查下是否delete file过多没有合并
问题3：从ScanTime来看耗时过久，怀疑存储侧是否有性能问题。

daxia · 2025年01月13日 12:50

嗯嗯，我们调整了访问存储桶的参数，现在好很多了。瓶颈应该不在SR这边