starrocks 聚合查询存在重复key

【详述】聚合查询会存在 重复分组,排除分组字符存在空白字符问题。修改时间范围可能正常,不会存在重复key。
【背景】执行聚合分析SQL,如下:

SELECT
	first_prop_dim AS keyword,
	SUM (count) AS clause_cnt
FROM
	(
	SELECT
		first_prop_dim,
		emotion_term,
		count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS "count"
	FROM
		view_comment_clause clause
	WHERE
		clause.goods_id IN ('684005614244')
		AND clause.first_prop_dim NOT IN (
                        '消费者特征', 
                        '购买行为', 
                        '使用情况', 
                        '品牌印象', 
                        '回购意愿', 
                        '推荐意愿', 
                        '整体评价', 
                        '产品对比', 
                        '其他维度'
                )
		AND (clause.comment_time BETWEEN '2023-03-01' AND '2023-03-31')
		AND first_prop_dim IS NOT NULL
	GROUP BY
		first_prop_dim
		,emotion_term
) arr
GROUP BY
	first_prop_dim
;

【业务影响】SQL 聚合查询结果不对。
【StarRocks版本 2.5.2】
【集群规模】1fe+3be(fe与be混部)
【机器信息】CPU虚拟核/内存:8C/64G
【联系方式】421231244@qq.com
【附件】
查询结果存在重复的分组如下:
image
排除空白字符问题:


修改时间范围:结果确正常了。

您的意思是条件 (clause.comment_time BETWEEN ‘2023-03-01’ AND ‘2023-03-31’)时,应该显示产品,物流,营销,客户4个。不应该显示那么多行,是这个意思吗?

是的,我是聚合分组查询,不应该存在分组相同的两行。

SELECT
first_prop_dim,
emotion_term,
count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS “count”
FROM
view_comment_clause clause
WHERE
clause.goods_id IN (‘684005614244’)
AND clause.first_prop_dim NOT IN (
‘消费者特征’,
‘购买行为’,
‘使用情况’,
‘品牌印象’,
‘回购意愿’,
‘推荐意愿’,
‘整体评价’,
‘产品对比’,
‘其他维度’
)
AND (clause.comment_time BETWEEN ‘2023-03-01’ AND ‘2023-03-31’)
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim
,emotion_term

SELECT
first_prop_dim,
count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS “count”
FROM
view_comment_clause clause
WHERE
clause.goods_id IN (‘684005614244’)
AND clause.first_prop_dim NOT IN (
‘消费者特征’,
‘购买行为’,
‘使用情况’,
‘品牌印象’,
‘回购意愿’,
‘推荐意愿’,
‘整体评价’,
‘产品对比’,
‘其他维度’
)
AND (clause.comment_time BETWEEN ‘2023-03-01’ AND ‘2023-03-31’)
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim

这两个SQL执行结果看一下。

执行结果分别如下:
image

请问存在重复key是指什么?

SELECT
first_prop_dim AS keyword,
SUM (count) AS clause_cnt
FROM
(
SELECT
first_prop_dim,
emotion_term,
count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS “count”
FROM
view_comment_clause clause
WHERE
clause.goods_id IN (‘684005614244’)
AND clause.first_prop_dim NOT IN (
‘消费者特征’,
‘购买行为’,
‘使用情况’,
‘品牌印象’,
‘回购意愿’,
‘推荐意愿’,
‘整体评价’,
‘产品对比’,
‘其他维度’
)
AND (clause.comment_time BETWEEN ‘2023-03-01’ AND ‘2023-03-31’)
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim
,emotion_term
) arr
GROUP BY
first_prop_dim
;
这个结果执行一下。

SELECT
arr.first_prop_dim AS keyword,
SUM (arr.count) AS clause_cnt
FROM
(
SELECT
first_prop_dim,
emotion_term,
count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS “count”
FROM
view_comment_clause clause
WHERE
clause.goods_id IN (‘684005614244’)
AND clause.first_prop_dim NOT IN (
‘消费者特征’,
‘购买行为’,
‘使用情况’,
‘品牌印象’,
‘回购意愿’,
‘推荐意愿’,
‘整体评价’,
‘产品对比’,
‘其他维度’
)
AND (clause.comment_time BETWEEN ‘2023-03-01’ AND ‘2023-03-31’)
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim
,emotion_term
) arr
GROUP BY
arr.first_prop_dim
;
这个也试一下

执行如下SQL:


结果如下:
image
改变查询条件,又会正常,比如改了查询时间范围。

不知道有没有明白我的意思?

ok,我们排查一下。

麻烦提供这两个SQL的explain costs,谢谢!

EXPLAIN
SELECT
first_prop_dim AS keyword,
SUM (COUNT) AS clause_cnt
FROM
(
SELECT
first_prop_dim,
emotion_term,
COUNT(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS 'count'
FROM
view_comment_clause clause
WHERE
clause.goods_id IN ('684005614244')
AND clause.first_prop_dim NOT IN (
'消费者特征',
'购买行为',
'使用情况',
'品牌印象',
'回购意愿',
'推荐意愿',
'整体评价',
'产品对比',
'其他维度'
)
AND (clause.comment_time BETWEEN '2023-03-01' AND '2023-03-31')
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim
,emotion_term
) arr
GROUP BY
first_prop_dim
;

Explain String                                                                                                                                                                                                                                                                                      
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PLAN FRAGMENT 0                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:5: first_prop_dim | 27: sum                                                                                                                                                                                                                                                           
  PARTITION: HASH_PARTITIONED: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                    
  RESULT SINK                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                    
  7:AGGREGATE (update finalize)                                                                                                                                                                                                                                                                     
  |  output: sum(26: count)                                                                                                                                                                                                                                                                         
  |  group by: 5: first_prop_dim                                                                                                                                                                                                                                                                    
  |                                                                                                                                                                                                                                                                                                 
  6:Project                                                                                                                                                                                                                                                                                         
  |  <slot 5> : 5: first_prop_dim                                                                                                                                                                                                                                                                   
  |  <slot 26> : 26: count                                                                                                                                                                                                                                                                          
  |                                                                                                                                                                                                                                                                                                 
  5:AGGREGATE (update finalize)                                                                                                                                                                                                                                                                     
  |  output: count(if(2: comment_id IS NULL, NULL, if(12: emotion_term IS NULL, NULL, 8: terminal_prop_dim)))                                                                                                                                                                                       
  |  group by: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                                  
  |                                                                                                                                                                                                                                                                                                 
  4:AGGREGATE (merge serialize)                                                                                                                                                                                                                                                                     
  |  group by: 2: comment_id, 5: first_prop_dim, 8: terminal_prop_dim, 12: emotion_term                                                                                                                                                                                                             
  |                                                                                                                                                                                                                                                                                                 
  3:EXCHANGE                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                    
PLAN FRAGMENT 1                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:                                                                                                                                                                                                                                                                                      
  PARTITION: RANDOM                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                    
  STREAM DATA SINK                                                                                                                                                                                                                                                                                  
    EXCHANGE ID: 03                                                                                                                                                                                                                                                                                 
    HASH_PARTITIONED: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                    
  2:AGGREGATE (update serialize)                                                                                                                                                                                                                                                                    
  |  STREAMING                                                                                                                                                                                                                                                                                      
  |  group by: 2: comment_id, 5: first_prop_dim, 8: terminal_prop_dim, 12: emotion_term                                                                                                                                                                                                             
  |                                                                                                                                                                                                                                                                                                 
  1:Project                                                                                                                                                                                                                                                                                         
  |  <slot 2> : 2: comment_id                                                                                                                                                                                                                                                                       
  |  <slot 5> : 5: first_prop_dim                                                                                                                                                                                                                                                                   
  |  <slot 8> : 8: terminal_prop_dim                                                                                                                                                                                                                                                                
  |  <slot 12> : 12: emotion_term                                                                                                                                                                                                                                                                   
  |                                                                                                                                                                                                                                                                                                 
  0:OlapScanNode                                                                                                                                                                                                                                                                                    
     TABLE: view_comment_clause                                                                                                                                                                                                                                                                     
     PREAGGREGATION: ON                                                                                                                                                                                                                                                                             
     PREDICATES: 1: goods_id = 684005614244, 5: first_prop_dim NOT IN ('消费者特征', '购买行为', '使用情况', '品牌印象', '回购意愿', '推荐意愿', '整体评价', '产品对比', '其他维度'), 3: comment_time <= '2023-03-31 00:00:00', 5: first_prop_dim IS NOT NULL  
     partitions=1/108                                                                                                                                                                                                                                                                               
     rollup: view_comment_clause                                                                                                                                                                                                                                                                    
     tabletRatio=1/6                                                                                                                                                                                                                                                                                
     tabletList=704033                                                                                                                                                                                                                                                                              
     cardinality=2619980                                                                                                                                                                                                                                                                            
     avgRowSize=6.0                                                                                                                                                                                                                                                                                 
     numNodes=0
EXPLAIN
SELECT
first_prop_dim AS keyword,
SUM (COUNT) AS clause_cnt
FROM
(
SELECT
first_prop_dim,
emotion_term,
COUNT(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS 'count'
FROM
view_comment_clause clause
WHERE
clause.goods_id IN ('684005614244')
AND clause.first_prop_dim NOT IN (
'消费者特征',
'购买行为',
'使用情况',
'品牌印象',
'回购意愿',
'推荐意愿',
'整体评价',
'产品对比',
'其他维度'
)
AND (clause.comment_time BETWEEN '2023-01-01' AND '2023-03-31')
AND first_prop_dim IS NOT NULL
GROUP BY
first_prop_dim
,emotion_term
) arr
GROUP BY
first_prop_dim
;
Explain String                                                                                                                                                                                                                                                                                      
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PLAN FRAGMENT 0                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:5: first_prop_dim | 27: sum                                                                                                                                                                                                                                                           
  PARTITION: UNPARTITIONED                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                    
  RESULT SINK                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                    
  10:EXCHANGE                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                    
PLAN FRAGMENT 1                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:                                                                                                                                                                                                                                                                                      
  PARTITION: HASH_PARTITIONED: 5: first_prop_dim                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                    
  STREAM DATA SINK                                                                                                                                                                                                                                                                                  
    EXCHANGE ID: 10                                                                                                                                                                                                                                                                                 
    UNPARTITIONED                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                    
  9:AGGREGATE (merge finalize)                                                                                                                                                                                                                                                                      
  |  output: sum(27: sum)                                                                                                                                                                                                                                                                           
  |  group by: 5: first_prop_dim                                                                                                                                                                                                                                                                    
  |                                                                                                                                                                                                                                                                                                 
  8:EXCHANGE                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                    
PLAN FRAGMENT 2                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:                                                                                                                                                                                                                                                                                      
  PARTITION: HASH_PARTITIONED: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                    
  STREAM DATA SINK                                                                                                                                                                                                                                                                                  
    EXCHANGE ID: 08                                                                                                                                                                                                                                                                                 
    HASH_PARTITIONED: 5: first_prop_dim                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                    
  7:AGGREGATE (update serialize)                                                                                                                                                                                                                                                                    
  |  STREAMING                                                                                                                                                                                                                                                                                      
  |  output: sum(26: count)                                                                                                                                                                                                                                                                         
  |  group by: 5: first_prop_dim                                                                                                                                                                                                                                                                    
  |                                                                                                                                                                                                                                                                                                 
  6:Project                                                                                                                                                                                                                                                                                         
  |  <slot 5> : 5: first_prop_dim                                                                                                                                                                                                                                                                   
  |  <slot 26> : 26: count                                                                                                                                                                                                                                                                          
  |                                                                                                                                                                                                                                                                                                 
  5:AGGREGATE (update finalize)                                                                                                                                                                                                                                                                     
  |  output: count(if(2: comment_id IS NULL, NULL, if(12: emotion_term IS NULL, NULL, 8: terminal_prop_dim)))                                                                                                                                                                                       
  |  group by: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                                  
  |                                                                                                                                                                                                                                                                                                 
  4:AGGREGATE (merge serialize)                                                                                                                                                                                                                                                                     
  |  group by: 2: comment_id, 5: first_prop_dim, 8: terminal_prop_dim, 12: emotion_term                                                                                                                                                                                                             
  |                                                                                                                                                                                                                                                                                                 
  3:EXCHANGE                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                    
PLAN FRAGMENT 3                                                                                                                                                                                                                                                                                     
 OUTPUT EXPRS:                                                                                                                                                                                                                                                                                      
  PARTITION: RANDOM                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                    
  STREAM DATA SINK                                                                                                                                                                                                                                                                                  
    EXCHANGE ID: 03                                                                                                                                                                                                                                                                                 
    HASH_PARTITIONED: 5: first_prop_dim, 12: emotion_term                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                    
  2:AGGREGATE (update serialize)                                                                                                                                                                                                                                                                    
  |  STREAMING                                                                                                                                                                                                                                                                                      
  |  group by: 2: comment_id, 5: first_prop_dim, 8: terminal_prop_dim, 12: emotion_term                                                                                                                                                                                                             
  |                                                                                                                                                                                                                                                                                                 
  1:Project                                                                                                                                                                                                                                                                                         
  |  <slot 2> : 2: comment_id                                                                                                                                                                                                                                                                       
  |  <slot 5> : 5: first_prop_dim                                                                                                                                                                                                                                                                   
  |  <slot 8> : 8: terminal_prop_dim                                                                                                                                                                                                                                                                
  |  <slot 12> : 12: emotion_term                                                                                                                                                                                                                                                                   
  |                                                                                                                                                                                                                                                                                                 
  0:OlapScanNode                                                                                                                                                                                                                                                                                    
     TABLE: view_comment_clause                                                                                                                                                                                                                                                                     
     PREAGGREGATION: ON                                                                                                                                                                                                                                                                             
     PREDICATES: 1: goods_id = 684005614244, 5: first_prop_dim NOT IN ('消费者特征', '购买行为', '使用情况', '品牌印象', '回购意愿', '推荐意愿', '整体评价', '产品对比', '其他维度'), 3: comment_time <= '2023-03-31 00:00:00', 5: first_prop_dim IS NOT NULL  
     partitions=3/108                                                                                                                                                                                                                                                                               
     rollup: view_comment_clause                                                                                                                                                                                                                                                                    
     tabletRatio=3/18                                                                                                                                                                                                                                                                               
     tabletList=703985,704009,704033                                                                                                                                                                                                                                                                
     cardinality=7960315                                                                                                                                                                                                                                                                            
     avgRowSize=6.0                                                                                                                                                                                                                                                                                 
     numNodes=0

现在还能复现吗?辛苦再发下两个 SQL 的如下信息吧:

  • 执行的 profile
  • explain costs SQL

《profile 分析》

这个问题有排查出来什么问题嘛?

排查出来了,是一个已知问题,预计 2.5.5 修复。
查询只命中一个 tablet 时会有问题,针对这个 SQL 设置 set new_planner_agg_stage = 4 可以暂时绕过这个问题:

SELECT /*+ SET_VAR(new_planner_agg_stage=4) */
	first_prop_dim AS keyword,
	SUM (count) AS clause_cnt
FROM 
-- ...

set new_planner_agg_stage = 4;
SELECT
	first_prop_dim AS keyword,
	SUM (count) AS clause_cnt
FROM
	(
	SELECT
		first_prop_dim,
		emotion_term,
		count(DISTINCT clause.comment_id, emotion_term, terminal_prop_dim) AS "count"
	FROM
		view_comment_clause clause
	WHERE
		clause.goods_id IN ('647149341429')
		-- category_name = '面膜'
		AND clause.first_prop_dim NOT IN (
                        '消费者特征', 
                        '购买行为', 
                        '使用情况', 
                        '品牌印象', 
                        '回购意愿', 
                        '推荐意愿', 
                        '整体评价', 
                        '产品对比', 
                        '其他维度'
                )
		AND (clause.comment_time BETWEEN '2023-03-01' AND '2023-03-31')
		AND first_prop_dim IS NOT NULL
	GROUP BY
		first_prop_dim,
		emotion_term
) arr
GROUP BY
	first_prop_dim
ORDER BY
	clause_cnt DESC
;

类似这样去执行SQL?

是的,只是针对这个 SQL 设置 new_planner_agg_stage=4,其他 SQL 依然保持 new_planner_agg_stage=0 即可。