【Crash】执行语句查询时所有be节点宕机starrocks::vectorized::ColumnHelper::create_column()

版本2.5.7
leader节点执行一个sql导致集群中所有的be节点宕机,follower节点执行出现build chunk meta error错误,
sql中hic_t_cc_rec表为分区表时,集群宕机,同样的表未分区查询没有问题,相关表和查询语句如下,根据该脚本数据量为空的情况下并没有复现问题,可做表结构参考crash.sql (25.9 KB)

be(1).out (25.3 KB)

be.warning的日志
beWARNING20231017.tar.gz (36.7 MB)

show backends; show frontends; 确认下版本一致吗

导致 be crash 的这个 sql 的explain costs 也发一下


image
一致的,这个fe是合并了两个pr,自己打了个包,不知道为啥版本号就这样了

PLAN FRAGMENT 0
OUTPUT EXPRS:425: row_id | 435: wo_id | 437: product_id | 455: fault_desc | 433: telephone2 | 637: date | 515: callchannel | 591: session_id | 608: create_time | 590: user_id | 589: account_id | 620: ter_user_phone | 636: row_number()
PARTITION: UNPARTITIONED

RESULT SINK

30:EXCHANGE
limit: 200

PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 425: row_id

STREAM DATA SINK
EXCHANGE ID: 30
UNPARTITIONED

29:Project
| <slot 425> : 425: row_id
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 589> : 589: account_id
| <slot 590> : 590: user_id
| <slot 591> : 591: session_id
| <slot 608> : 608: create_time
| <slot 620> : 620: ter_user_phone
| <slot 636> : 636: row_number()
| <slot 637> : date(426: enter_time)
| limit: 200
|
28:SELECT
| predicates: 636: row_number() = 1
| limit: 200
|
27:ANALYTIC
| functions: [, row_number(), ]
| partition by: 425: row_id
| order by: 608: create_time ASC
| window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|
26:SORT
| order by: <slot 425> 425: row_id ASC, <slot 608> 608: create_time ASC
| offset: 0
|
25:EXCHANGE

PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 433: telephone2

STREAM DATA SINK
EXCHANGE ID: 25
HASH_PARTITIONED: 425: row_id

24:PARTITION-TOP-N
| partition by: 425: row_id
| partition limit: 1
| order by: <slot 425> 425: row_id ASC, <slot 608> 608: create_time ASC
| offset: 0
|
23:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 433: telephone2 = 620: ter_user_phone
| other join predicates: 608: create_time > 426: enter_time, days_sub(608: create_time, 7) <= 426: enter_time
|
|----22:EXCHANGE
|
15:EXCHANGE

PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 590: user_id

STREAM DATA SINK
EXCHANGE ID: 22
HASH_PARTITIONED: 620: ter_user_phone

21:Project
| <slot 589> : 589: account_id
| <slot 590> : 590: user_id
| <slot 591> : 591: session_id
| <slot 608> : 608: create_time
| <slot 620> : 620: ter_user_phone
|
20:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 590: user_id = 622: ter_user_id
| equal join conjunct: 589: account_id = 623: terminal_id
|
|----19:EXCHANGE
|
17:EXCHANGE

PLAN FRAGMENT 4
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 19
HASH_PARTITIONED: 622: ter_user_id

18:OlapScanNode
TABLE: lemc_terminal_user_info
PREAGGREGATION: ON
PREDICATES: 620: ter_user_phone IS NOT NULL
partitions=1/1
rollup: lemc_terminal_user_info
tabletRatio=8/8
tabletList=237751916,237751920,237751924,237751928,237751932,237751936,237751940,237751944
cardinality=4854916
avgRowSize=3.0
numNodes=0

PLAN FRAGMENT 5
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 17
HASH_PARTITIONED: 590: user_id

16:OlapScanNode
TABLE: lemc_im_session_flow
PREAGGREGATION: ON
PREDICATES: 608: create_time >= ‘2022-06-07 14:26:22’
partitions=1/1
rollup: lemc_im_session_flow
tabletRatio=12/12
tabletList=143355052,143355056,143355060,143355064,143355068,143355072,143355076,143355080,143355084,143355088 …
cardinality=3436951
avgRowSize=63.29679
numNodes=0

PLAN FRAGMENT 6
OUTPUT EXPRS:
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 15
HASH_PARTITIONED: 433: telephone2

14:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
|
13:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 566: area_id = 574: area_id
|
|----12:EXCHANGE
|
9:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 566> : 566: area_id
|
8:HASH JOIN
| join op: LEFT OUTER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 540: organ_id = 562: organ_id
|
|----7:EXCHANGE
|
5:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 540> : 540: organ_id
|
4:HASH JOIN
| join op: LEFT OUTER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 427: emp_no = 520: user_code
|
|----3:EXCHANGE
|
1:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 427> : 427: emp_no

合了两个pr打的包,是 sr 这边提供的么

不是,是自己打的包

这个能先换回标准的 2.5.7 包 再查询看看么

这个不能啊,咱们有同事在帮忙看了

麻烦再拿一下costs plan 以及 query_dump

  1. explain costs sql
  2. query dump获取方式参考:https://docs.starrocks.io/zh-cn/latest/faq/Dump_query#keywords

fe合了哪两个pr?你们fork项目的地址发我下

2.5.7 https://github.com/StarRocks/starrocks/pull/26358 解决从2.3.11升级2.5.7时fe启动不起来问题
2.5.7 https://github.com/StarRocks/starrocks/pull/29432 解决数据库死锁问题

我没有fork,就在本地cherry-pick了一下

dump_file (36.0 KB)