iceberg 表关联查询导致全部 BE 节点 EC2 不可用

【详述】iceberg 表关联查询导致全部 BE 节点 EC2 不可用
【背景】
今天凌晨收到查询集群的大量告警,显示查询集群 BE 节点磁盘 I/O 高,然后是不可达。

此向 AWS 提交了工单排查问题。一开始以为是 AWS 底层问题导致,经过排查发现磁盘 Read I/O 高只是表象,不是引起 EC2 挂掉的根本原因。因为我们是存算分离集群,本地没有数据,无法支撑这么高的 Read OPS。

于是开始排查告警发生之前有哪些查询,以下是最终确定的 SQL:

WITH assigned_accounts AS (
		SELECT *
		FROM pt_mid.dwd_spark_daily_option_exercise_assignment_record_df
		WHERE yyyymmdd = 20250115
			AND cash_trade_date IS NULL
		LIMIT 1000
	), 
	parsed_data AS (
		SELECT yyyymmdd, account_no, account_type, security_subtype, SPLIT(symbol, ' ')[0] AS stock_symbol
			, business_type, quantity
		FROM assigned_accounts
	)
SELECT *
FROM parsed_data p
	LEFT JOIN ice.ice_dwd.dwd_assets_position_sodhis_di a
	ON p.account_no = a.broker_account_id
		AND p.stock_symbol = a.symbol
		AND p.yyyymmdd = a.yyyymmdd
LIMIT 1000;

我们以为是 arm 架构 starrocks 才有的问题,于是把已停机的 x86 集群启动然后查询,也出现了同样的问题。

【业务影响】全部固定和已扩容 BE 节点所在 EC2 不可用
【是否存算分离】是
【StarRocks版本】x86:3.1.4,arm:3.3.5
【集群规模】
x86
3 *(fe + be)混部,1 * be 单独部署,共4台机器(16c 128g)
arm
3 * fe(4c 16g),4 * be(16c 128g),共7台机器
【机器信息】AWS EC2,x86:r5.4xlarge,arm:fe m7g.xlarge、be r7g.4xlarge
【联系方式】StarRocks-存算分离2群 - 飞速土豆
【附件】
x86 集群监控截图,15:34 之后由于节点全部挂掉,监控中断
image
image
arm 集群监控截图,03:40 之后由于节点全部挂掉,监控中断。但有一台机器在04:00自动恢复。
image
image

要用data cache缓存加速吧,用SSD磁盘。

应该跟这个没关系。即使 SQL 已经查询出结果(为空),内存和CPU还在往上涨,直至 EC2 挂掉。在查询过程中,并没有溢写磁盘。

现在能确定这个这个disk io主要是在读什么数据吗?3.1版本里datacache默认是关闭的,如果是读cache数据,就先确认下cn.conf里block cache的相关配置以及enable_scan_block_cache session变量的状态

read ops 主要是系统盘和数据盘很高,溢写盘倒是正常。由于是存算分离集群,数据都在 s3,且没有 write ops,不确定 read ops 都是在读取哪些数据。

另外,SQL 执行只需要 1.89 sec,但是执行完后 CPU 未下降,内存利用率还在增长



image

以下是 block cache 相关的配置。

3.1.4
fe:
enable_populate_block_cache=true
enable_scan_block_cache=false

be:
block_cache_block_buffer_enable=1
block_cache_block_size=262144
block_cache_checksum_enable=0
block_cache_direct_io_enable=0
block_cache_disk_path=/data/soft/starrocks/be/block_cache
block_cache_disk_size=21474836480
block_cache_enable=1
block_cache_engine=
block_cache_lru_insertion_point=1
block_cache_max_concurrent_inserts=1500000
block_cache_max_parcel_memory_mb=256
block_cache_mem_size=5368709120
block_cache_meta_path=/data/soft/starrocks/be/block_cache/
block_cache_report_stats=0

3.3.5
fe:

be:
block_cache_block_size=262144
block_cache_checksum_enable=false
block_cache_direct_io_enable=false
block_cache_disk_path=/data/soft/starrocks/be/block_cache
block_cache_disk_size=21474836480
block_cache_enable=true
block_cache_engine=
block_cache_max_concurrent_inserts=1500000
block_cache_mem_size=5368709120
block_cache_meta_path=/data/soft/starrocks/be/block_cache/

以下是是测试时用 iotop 获取到的 iops

3.1里enable_scan_block_cache=false说明没有启用blockcache,所以应该跟那个没关系。

两个版本都是在执行这条sql的时候出问题的吗?查询中有涉及物化视图吗?或者出现spill等情况?

嗯,x86 3.1.4 和 arm 3.3.5 执行这条 SQL 都会出现这个问题。

不涉及物化视图,就是普通的 hive 关联 iceberg。

正如之前提到的,没有出现很高的 write ops,溢写盘利用率也没有变化

我看x86那个集群挂掉之前的磁盘iops好像并不高吧?看下dmesg -T里机器挂掉之前有啥异常信息吗?

嗯,最新的测试显示磁盘 iops 不一定会很高,机器也不一定会挂掉。唯一跟之前一样的是 SQL 执行只要几秒,但是 CPU、内存却一直在上涨。

dmesg -T 除了机器启动时的信息,没有新的信息了。

感觉这个crash应该跟磁盘io没有必然关系。
可以看下/var/log/下的内核日志里有没有保留一些信息。
另外,如果查询结束后CPU上涨这个现在是稳定复现的,可以在3.3里set global enable_scan_datacache=false再试下。
如果还是存在,可以在查询结束后在CN节点上执行下perf top操作看下当前CPU主要消耗在什么地方。

/var/log/ 下没有发现异常的信息。

设置 enable_scan_datacache=false 也同样会复现。

这是 CPU 增长时 perf top 截图

这是查询结束后的perf top?看着还在read和解压parquet文件啊。你show processlist看下是不是还有其他请求。或者把BE的log级别调高一些看下这会儿在干啥

是查询结束后的 perf top,SQL 执行时间只需要几秒钟。

本次测试是在 x86 集群上进行的,这个集群平时是停机的且业务已经全部迁移。

截图时除了执行有问题的 SQL,再没有其它查询。


这个CPU和内存在查询结束后多久会降下去?如果没有其他查询的话这里perf里的read操作应该还是这条查询的。应该是因为你的sql里带了limit,然后上层读取到需要的数量已经结束返回了,但是传递到BE下游已经在执行的还需要一段时间才能finish。
执行下explain看下plan呢?

2-4分钟之后会降下去。

这是带 limit 的 explain:

mysql> explain WITH assigned_accounts AS ( SELECT * FROM pt_mid.dwd_spark_daily_option_exercise_assignment_record_df WHERE yyyymmdd = 20250115 and cash_trade_date is NULL LIMIT 1000 ), parsed_data AS ( SELECT yyyymmdd, account_no, account_type, security_subtype, SPLIT(symbol, ' ')[0] AS stock_symbol, business_type, quantity FROM assigned_accounts ) SELECT * FROM parsed_data p LEFT JOIN ice.ice_dwd.dwd_assets_position_sodhis_di a ON p.account_no = a.broker_account_id AND p.stock_symbol = a.symbol and p.yyyymmdd = a.yyyymmdd limit 1000\G
*************************** 1. row ***************************
Explain String: PLAN FRAGMENT 0
*************************** 2. row ***************************
Explain String:  OUTPUT EXPRS:82: yyyymmdd | 60: account_no | 61: account_type | 62: security_subtype | 83: expr | 64: business_type | 69: quantity | 84: user_id | 85: broker_id | 86: broker_account_id | 87: process_date | 88: firm | 89: correspondentid | 90: correspondent_office_id | 91: office_code | 92: registered_rep_code | 93: account_type | 94: symbol | 95: cusip | 96: trade_quantity | 97: settle_quantity | 98: currency_code | 99: security_type_code | 100: ticker_type | 101: descp | 102: margin_eligible_code | 103: closing_price | 104: last_activty_date | 105: loc_location | 106: loc_memo | 107: option_amount | 108: conversion_factor | 109: underlying_cusip | 110: option_symbol_root | 111: option_contract_date | 112: strike_price | 113: callput | 114: expiration_delivery_date | 115: position_market_value | 116: customer_source | 117: origin_data_region | 118: yyyymmdd
*************************** 3. row ***************************
Explain String:   PARTITION: UNPARTITIONED
*************************** 4. row ***************************
Explain String:
*************************** 5. row ***************************
Explain String:   RESULT SINK
*************************** 6. row ***************************
Explain String:
*************************** 7. row ***************************
Explain String:   6:EXCHANGE
*************************** 8. row ***************************
Explain String:      limit: 1000
*************************** 9. row ***************************
Explain String:
*************************** 10. row ***************************
Explain String: PLAN FRAGMENT 1
*************************** 11. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 12. row ***************************
Explain String:   PARTITION: HASH_PARTITIONED: 86: broker_account_id, 94: symbol, 118: yyyymmdd
*************************** 13. row ***************************
Explain String:
*************************** 14. row ***************************
Explain String:   STREAM DATA SINK
*************************** 15. row ***************************
Explain String:     EXCHANGE ID: 06
*************************** 16. row ***************************
Explain String:     UNPARTITIONED
*************************** 17. row ***************************
Explain String:
*************************** 18. row ***************************
Explain String:   5:HASH JOIN
*************************** 19. row ***************************
Explain String:   |  join op: RIGHT OUTER JOIN (PARTITIONED)
*************************** 20. row ***************************
Explain String:   |  colocate: false, reason:
*************************** 21. row ***************************
Explain String:   |  equal join conjunct: 86: broker_account_id = 60: account_no
*************************** 22. row ***************************
Explain String:   |  equal join conjunct: 94: symbol = 83: expr
*************************** 23. row ***************************
Explain String:   |  equal join conjunct: 118: yyyymmdd = 82: yyyymmdd
*************************** 24. row ***************************
Explain String:   |  limit: 1000
*************************** 25. row ***************************
Explain String:   |
*************************** 26. row ***************************
Explain String:   |----4:EXCHANGE
*************************** 27. row ***************************
Explain String:   |       limit: 1000
*************************** 28. row ***************************
Explain String:   |
*************************** 29. row ***************************
Explain String:   1:EXCHANGE
*************************** 30. row ***************************
Explain String:
*************************** 31. row ***************************
Explain String: PLAN FRAGMENT 2
*************************** 32. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 33. row ***************************
Explain String:   PARTITION: RANDOM
*************************** 34. row ***************************
Explain String:
*************************** 35. row ***************************
Explain String:   STREAM DATA SINK
*************************** 36. row ***************************
Explain String:     EXCHANGE ID: 04
*************************** 37. row ***************************
Explain String:     HASH_PARTITIONED: 60: account_no, 83: expr, 82: yyyymmdd
*************************** 38. row ***************************
Explain String:
*************************** 39. row ***************************
Explain String:   3:Project
*************************** 40. row ***************************
Explain String:   |  <slot 60> : 60: account_no
*************************** 41. row ***************************
Explain String:   |  <slot 61> : 61: account_type
*************************** 42. row ***************************
Explain String:   |  <slot 62> : 62: security_subtype
*************************** 43. row ***************************
Explain String:   |  <slot 64> : 64: business_type
*************************** 44. row ***************************
Explain String:   |  <slot 69> : 69: quantity
*************************** 45. row ***************************
Explain String:   |  <slot 82> : 82: yyyymmdd
*************************** 46. row ***************************
Explain String:   |  <slot 83> : split(59: symbol, ' ')[0]
*************************** 47. row ***************************
Explain String:   |  limit: 1000
*************************** 48. row ***************************
Explain String:   |
*************************** 49. row ***************************
Explain String:   2:HdfsScanNode
*************************** 50. row ***************************
Explain String:      TABLE: dwd_spark_daily_option_exercise_assignment_record_df
*************************** 51. row ***************************
Explain String:      PARTITION PREDICATES: 82: yyyymmdd = '20250115'
*************************** 52. row ***************************
Explain String:      NON-PARTITION PREDICATES: 80: cash_trade_date IS NULL
*************************** 53. row ***************************
Explain String:      partitions=1/1
*************************** 54. row ***************************
Explain String:      cardinality=2
*************************** 55. row ***************************
Explain String:      avgRowSize=9.0
*************************** 56. row ***************************
Explain String:      numNodes=0
*************************** 57. row ***************************
Explain String:      limit: 1000
*************************** 58. row ***************************
Explain String:
*************************** 59. row ***************************
Explain String: PLAN FRAGMENT 3
*************************** 60. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 61. row ***************************
Explain String:   PARTITION: RANDOM
*************************** 62. row ***************************
Explain String:
*************************** 63. row ***************************
Explain String:   STREAM DATA SINK
*************************** 64. row ***************************
Explain String:     EXCHANGE ID: 01
*************************** 65. row ***************************
Explain String:     HASH_PARTITIONED: 86: broker_account_id, 94: symbol, 118: yyyymmdd
*************************** 66. row ***************************
Explain String:
*************************** 67. row ***************************
Explain String:   0:IcebergScanNode
*************************** 68. row ***************************
Explain String:      TABLE: ice.ice_dwd.dwd_assets_position_sodhis_di
*************************** 69. row ***************************
Explain String:      cardinality=13170616485
*************************** 70. row ***************************
Explain String:      avgRowSize=35.0
*************************** 71. row ***************************
Explain String:      numNodes=0
71 rows in set (1.53 sec)

另外,还测试了不带 limit 的查询,现象都是一样的:

WITH assigned_accounts AS (
		SELECT *
		FROM pt_mid.dwd_spark_daily_option_exercise_assignment_record_df
		WHERE yyyymmdd = 20250115
			AND cash_trade_date IS NULL
	), 
	parsed_data AS (
		SELECT yyyymmdd, account_no, account_type, security_subtype, SPLIT(symbol, ' ')[0] AS stock_symbol
			, business_type, quantity
		FROM assigned_accounts
	)
SELECT *
FROM parsed_data p
	LEFT JOIN ice.ice_dwd.dwd_assets_position_sodhis_di a
	ON p.account_no = a.broker_account_id
		AND p.stock_symbol = a.symbol
		AND p.yyyymmdd = a.yyyymmdd
;

SELECT *
FROM pt_mid.dwd_spark_daily_option_exercise_assignment_record_df
WHERE yyyymmdd = 20250115
	AND cash_trade_date IS NULL;


这是不带 limit 的 explain:

mysql> explain WITH assigned_accounts AS ( SELECT * FROM pt_mid.dwd_spark_daily_option_exercise_assignment_record_df WHERE yyyymmdd = 20250115 AND cash_trade_date IS NULL ),  parsed_data AS ( SELECT yyyymmdd, account_no, account_type, security_subtype, SPLIT(symbol, ' ')[0] AS stock_symbol , business_type, quantity FROM assigned_accounts ) SELECT * FROM parsed_data p LEFT JOIN ice.ice_dwd.dwd_assets_position_sodhis_di a ON p.account_no = a.broker_account_id AND p.stock_symbol = a.symbol AND p.yyyymmdd = a.yyyymmdd\G
*************************** 1. row ***************************
Explain String: PLAN FRAGMENT 0
*************************** 2. row ***************************
Explain String:  OUTPUT EXPRS:82: yyyymmdd | 60: account_no | 61: account_type | 62: security_subtype | 83: expr |64: business_type | 69: quantity | 84: user_id | 85: broker_id | 86: broker_account_id | 87: process_date | 88: firm | 89: correspondentid | 90: correspondent_office_id | 91: office_code | 92: registered_rep_code | 93: account_type | 94: symbol | 95: cusip | 96: trade_quantity | 97: settle_quantity | 98: currency_code | 99: security_type_code| 100: ticker_type | 101: descp | 102: margin_eligible_code | 103: closing_price | 104: last_activty_date | 105: loc_location | 106: loc_memo | 107: option_amount | 108: conversion_factor | 109: underlying_cusip | 110: option_symbol_root | 111: option_contract_date | 112: strike_price | 113: callput | 114: expiration_delivery_date | 115: position_market_value | 116: customer_source | 117: origin_data_region | 118: yyyymmdd
*************************** 3. row ***************************
Explain String:   PARTITION: UNPARTITIONED
*************************** 4. row ***************************
Explain String:
*************************** 5. row ***************************
Explain String:   RESULT SINK
*************************** 6. row ***************************
Explain String:
*************************** 7. row ***************************
Explain String:   6:EXCHANGE
*************************** 8. row ***************************
Explain String:
*************************** 9. row ***************************
Explain String: PLAN FRAGMENT 1
*************************** 10. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 11. row ***************************
Explain String:   PARTITION: HASH_PARTITIONED: 86: broker_account_id, 94: symbol, 118: yyyymmdd
*************************** 12. row ***************************
Explain String:
*************************** 13. row ***************************
Explain String:   STREAM DATA SINK
*************************** 14. row ***************************
Explain String:     EXCHANGE ID: 06
*************************** 15. row ***************************
Explain String:     UNPARTITIONED
*************************** 16. row ***************************
Explain String:
*************************** 17. row ***************************
Explain String:   5:HASH JOIN
*************************** 18. row ***************************
Explain String:   |  join op: RIGHT OUTER JOIN (PARTITIONED)
*************************** 19. row ***************************
Explain String:   |  colocate: false, reason:
*************************** 20. row ***************************
Explain String:   |  equal join conjunct: 86: broker_account_id = 60: account_no
*************************** 21. row ***************************
Explain String:   |  equal join conjunct: 94: symbol = 83: expr
*************************** 22. row ***************************
Explain String:   |  equal join conjunct: 118: yyyymmdd = 82: yyyymmdd
*************************** 23. row ***************************
Explain String:   |
*************************** 24. row ***************************
Explain String:   |----4:EXCHANGE
*************************** 25. row ***************************
Explain String:   |
*************************** 26. row ***************************
Explain String:   1:EXCHANGE
*************************** 27. row ***************************
Explain String:
*************************** 28. row ***************************
Explain String: PLAN FRAGMENT 2
*************************** 29. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 30. row ***************************
Explain String:   PARTITION: RANDOM
*************************** 31. row ***************************
Explain String:
*************************** 32. row ***************************
Explain String:   STREAM DATA SINK
*************************** 33. row ***************************
Explain String:     EXCHANGE ID: 04
*************************** 34. row ***************************
Explain String:     HASH_PARTITIONED: 60: account_no, 83: expr, 82: yyyymmdd
*************************** 35. row ***************************
Explain String:
*************************** 36. row ***************************
Explain String:   3:Project
*************************** 37. row ***************************
Explain String:   |  <slot 60> : 60: account_no
*************************** 38. row ***************************
Explain String:   |  <slot 61> : 61: account_type
*************************** 39. row ***************************
Explain String:   |  <slot 62> : 62: security_subtype
*************************** 40. row ***************************
Explain String:   |  <slot 64> : 64: business_type
*************************** 41. row ***************************
Explain String:   |  <slot 69> : 69: quantity
*************************** 42. row ***************************
Explain String:   |  <slot 82> : 82: yyyymmdd
*************************** 43. row ***************************
Explain String:   |  <slot 83> : split(59: symbol, ' ')[0]
*************************** 44. row ***************************
Explain String:   |
*************************** 45. row ***************************
Explain String:   2:HdfsScanNode
*************************** 46. row ***************************
Explain String:      TABLE: dwd_spark_daily_option_exercise_assignment_record_df
*************************** 47. row ***************************
Explain String:      PARTITION PREDICATES: 82: yyyymmdd = '20250115'
*************************** 48. row ***************************
Explain String:      NON-PARTITION PREDICATES: 80: cash_trade_date IS NULL
*************************** 49. row ***************************
Explain String:      partitions=1/1
*************************** 50. row ***************************
Explain String:      cardinality=2
*************************** 51. row ***************************
Explain String:      avgRowSize=9.0
*************************** 52. row ***************************
Explain String:      numNodes=0
*************************** 53. row ***************************
Explain String:
*************************** 54. row ***************************
Explain String: PLAN FRAGMENT 3
*************************** 55. row ***************************
Explain String:  OUTPUT EXPRS:
*************************** 56. row ***************************
Explain String:   PARTITION: RANDOM
*************************** 57. row ***************************
Explain String:
*************************** 58. row ***************************
Explain String:   STREAM DATA SINK
*************************** 59. row ***************************
Explain String:     EXCHANGE ID: 01
*************************** 60. row ***************************
Explain String:     HASH_PARTITIONED: 86: broker_account_id, 94: symbol, 118: yyyymmdd
*************************** 61. row ***************************
Explain String:
*************************** 62. row ***************************
Explain String:   0:IcebergScanNode
*************************** 63. row ***************************
Explain String:      TABLE: ice.ice_dwd.dwd_assets_position_sodhis_di
*************************** 64. row ***************************
Explain String:      cardinality=13170616485
*************************** 65. row ***************************
Explain String:      avgRowSize=35.0
*************************** 66. row ***************************
Explain String:      numNodes=0
66 rows in set (1.48 sec)

过两分钟结束后你可以看下be.INFO里应该会打印query_id相关的信息,比如"finished query_id"之类的,里面会显示你这个query的执行时间,内存消耗以及scan bytes,在fe的audit log里找下这个query_id是不是你之前结束的查询。如果是一个查询,可以考虑把BE的vlog打开,看下这段时间到底在做什么。

这是在测试用的 ARM 集群上查询的 audit 日志。由于 be 节点挂掉,未能在 be 节点找到 query id 相关的日志。

{
	"Timestamp": "1737711157010",
	"Client": "10.60.22.0:44034",
	"User": "root",
	"AuthorizedUser": "'root'@'%'",
	"ResourceGroup": "default_wg",
	"Catalog": "unified_catalog",
	"Db": "",
	"State": "EOF",
	"ErrorCode": "",
	"Time": "3984",
	"ScanBytes": "17022",
	"ScanRows": "0",
	"ReturnRows": "0",
	"CpuCostNs": "11916257",
	"MemCostBytes": "53775824",
	"StmtId": "37",
	"QueryId": "25dcef5c-da36-11ef-8d08-06ae1418ac3a",
	"IsQuery": "true",
	"feIp": "10.60.22.0",
	"Stmt": "with tmp as( select * from us_dim.dim_user_register_info where user_id ='dddd' ), tmp1 as(select * from tmp) select * from tmp1 t1 left join ice.iceberg_db.iceberg_lineitem_parquet_zstd_multi_v2 t2 on t1.uuid=t2.id and t1.user_id=t2.l_shipdate limit 10",
	"Digest": "",
	"PlanCpuCost": "1.6531322052E12",
	"PlanMemCost": "896.0",
	"PendingTimeMs": "19",
	"BigQueryLogCPUSecondThreshold": "480",
	"BigQueryLogScanBytesThreshold": "10737418240",
	"BigQueryLogScanRowsThreshold": "1000000000",
	"Warehouse": "default_warehouse",
	"CandidateMVs": null,
	"HitMvs": null,
	"IsForwardToLeader": "false"
}

在FE的log/fe.internal.log下看下执行这条查询时有没有触发一些其他内部的query。如果没有,开vlog看下,实在不行就在查询结束后搞个core文件出来debug看下是哪个query的。