版本2.5.7
leader节点执行一个sql导致集群中所有的be节点宕机,follower节点执行出现build chunk meta error错误,
sql中hic_t_cc_rec表为分区表时,集群宕机,同样的表未分区查询没有问题,相关表和查询语句如下,根据该脚本数据量为空的情况下并没有复现问题,可做表结构参考crash.sql (25.9 KB)
be(1).out (25.3 KB)
版本2.5.7
leader节点执行一个sql导致集群中所有的be节点宕机,follower节点执行出现build chunk meta error错误,
sql中hic_t_cc_rec表为分区表时,集群宕机,同样的表未分区查询没有问题,相关表和查询语句如下,根据该脚本数据量为空的情况下并没有复现问题,可做表结构参考crash.sql (25.9 KB)
be(1).out (25.3 KB)
show backends; show frontends; 确认下版本一致吗
导致 be crash 的这个 sql 的explain costs 也发一下
PLAN FRAGMENT 0
OUTPUT EXPRS:425: row_id | 435: wo_id | 437: product_id | 455: fault_desc | 433: telephone2 | 637: date | 515: callchannel | 591: session_id | 608: create_time | 590: user_id | 589: account_id | 620: ter_user_phone | 636: row_number()
PARTITION: UNPARTITIONED
RESULT SINK
30:EXCHANGE
limit: 200
PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 425: row_id
STREAM DATA SINK
EXCHANGE ID: 30
UNPARTITIONED
29:Project
| <slot 425> : 425: row_id
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 589> : 589: account_id
| <slot 590> : 590: user_id
| <slot 591> : 591: session_id
| <slot 608> : 608: create_time
| <slot 620> : 620: ter_user_phone
| <slot 636> : 636: row_number()
| <slot 637> : date(426: enter_time)
| limit: 200
|
28:SELECT
| predicates: 636: row_number() = 1
| limit: 200
|
27:ANALYTIC
| functions: [, row_number(), ]
| partition by: 425: row_id
| order by: 608: create_time ASC
| window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|
26:SORT
| order by: <slot 425> 425: row_id ASC, <slot 608> 608: create_time ASC
| offset: 0
|
25:EXCHANGE
PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 433: telephone2
STREAM DATA SINK
EXCHANGE ID: 25
HASH_PARTITIONED: 425: row_id
24:PARTITION-TOP-N
| partition by: 425: row_id
| partition limit: 1
| order by: <slot 425> 425: row_id ASC, <slot 608> 608: create_time ASC
| offset: 0
|
23:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 433: telephone2 = 620: ter_user_phone
| other join predicates: 608: create_time > 426: enter_time, days_sub(608: create_time, 7) <= 426: enter_time
|
|----22:EXCHANGE
|
15:EXCHANGE
PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: 590: user_id
STREAM DATA SINK
EXCHANGE ID: 22
HASH_PARTITIONED: 620: ter_user_phone
21:Project
| <slot 589> : 589: account_id
| <slot 590> : 590: user_id
| <slot 591> : 591: session_id
| <slot 608> : 608: create_time
| <slot 620> : 620: ter_user_phone
|
20:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| colocate: false, reason:
| equal join conjunct: 590: user_id = 622: ter_user_id
| equal join conjunct: 589: account_id = 623: terminal_id
|
|----19:EXCHANGE
|
17:EXCHANGE
PLAN FRAGMENT 4
OUTPUT EXPRS:
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 19
HASH_PARTITIONED: 622: ter_user_id
18:OlapScanNode
TABLE: lemc_terminal_user_info
PREAGGREGATION: ON
PREDICATES: 620: ter_user_phone IS NOT NULL
partitions=1/1
rollup: lemc_terminal_user_info
tabletRatio=8/8
tabletList=237751916,237751920,237751924,237751928,237751932,237751936,237751940,237751944
cardinality=4854916
avgRowSize=3.0
numNodes=0
PLAN FRAGMENT 5
OUTPUT EXPRS:
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 17
HASH_PARTITIONED: 590: user_id
16:OlapScanNode
TABLE: lemc_im_session_flow
PREAGGREGATION: ON
PREDICATES: 608: create_time >= ‘2022-06-07 14:26:22’
partitions=1/1
rollup: lemc_im_session_flow
tabletRatio=12/12
tabletList=143355052,143355056,143355060,143355064,143355068,143355072,143355076,143355080,143355084,143355088 …
cardinality=3436951
avgRowSize=63.29679
numNodes=0
PLAN FRAGMENT 6
OUTPUT EXPRS:
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 15
HASH_PARTITIONED: 433: telephone2
14:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
|
13:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 566: area_id = 574: area_id
|
|----12:EXCHANGE
|
9:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 566> : 566: area_id
|
8:HASH JOIN
| join op: LEFT OUTER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 540: organ_id = 562: organ_id
|
|----7:EXCHANGE
|
5:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 433> : 433: telephone2
| <slot 435> : 435: wo_id
| <slot 437> : 437: product_id
| <slot 455> : 455: fault_desc
| <slot 515> : 515: callchannel
| <slot 540> : 540: organ_id
|
4:HASH JOIN
| join op: LEFT OUTER JOIN (BROADCAST)
| colocate: false, reason:
| equal join conjunct: 427: emp_no = 520: user_code
|
|----3:EXCHANGE
|
1:Project
| <slot 425> : 425: row_id
| <slot 426> : 426: enter_time
| <slot 427> : 427: emp_no
合了两个pr打的包,是 sr 这边提供的么
不是,是自己打的包
这个能先换回标准的 2.5.7 包 再查询看看么
这个不能啊,咱们有同事在帮忙看了
麻烦再拿一下costs plan 以及 query_dump
fe合了哪两个pr?你们fork项目的地址发我下
2.5.7 https://github.com/StarRocks/starrocks/pull/26358 解决从2.3.11升级2.5.7时fe启动不起来问题
2.5.7 https://github.com/StarRocks/starrocks/pull/29432 解决数据库死锁问题
我没有fork,就在本地cherry-pick了一下