StarRocks查询主键表超时,主键表某个分区不可查询如何修复

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】表是主键模型,实时写入一段时间后,最新分区无法查询,查询时报错:query timeout. backend id: 11129,这种问题如何修复?
【背景】
【业务影响】
【StarRocks版本】3.1.0
【集群规模】例如:3fe+6be
【机器信息】16C/64G/万兆
【附件】fe.warn日志
WARN (AutoStatistic|25) [Coordinator.deliverExecBatchFragmentsRequests():1123] exec plan fragment failed, errmsg=wait_for_version timeout(56002ms) version:3280 tablet:791182 #version:1 [3028 3028@0 3028] pending:3030,3031,3032,3033,3034,3035,3036,3037,3038,3039,3040,3041,3042,3043,3044,3045,3046,3047,3048,3049,3050,3051,3052,3053,3054,3055,3056,3057,3058,3059,3060,3061,3062,3063,3064,3065,3066,3067,3068,3069,3070,3071,3072,3073,3074,3075,3076,3077,3078,3079,3080,3081,3082,3083,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3100,3101,3102,3103,3104,3105,3106,3107,3108,3109,3110,3111,3112,3113,3114,3115,3116,3117,3118,3119,3120,3121,3122,3123,3124,3125,3126,3127,3128,3129,3130,3131,3132,3133,3134,3135,3136,3137,3138,3139,3140,3141,3142,3143,3144,3145,3146,3147,3148,3149,3150,3151,3152,3153,3154,3155,3156,3157,3158,3159,3160,3161,3162,3163,3164,3165,3166,3167,3168,3169,3170,3171,3172,3173,3174,3175,3176,3177,3178,3179,3180,3181,3182,3183,3184,3185,3186,3187,3188,3189,3190,3191,3192,3193,3194,3195,3196,3197,3198,3199,3200,3201,3202,3203,3204,3205,3206,3207,3208,3209,3210,3211,3212,3213,3214,3215,3216,3217,3218,3219,3220,3221,3222,3223,3224,3225,3226,3227,3228,3229,3230,3231,3232,3233,3234,3235,3236,3237,3238,3239,3240,3241,3242,3243,3244,3245,3246,3247,3248,3249,3250,3251,3252,3253,3254,3255,3256,3257,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280, rowsets:6[id/seg/row/del/byte/compaction]: [0/1/725462/1808/39.70 MB/216.79 MB],[1/1/802/232/49.95 KB/256.02 MB],[2/1/892/429/52.79 KB/256.07 MB],[3/1/508/341/34.55 KB/256.08 MB],[4/1/695/275/44.14 KB/256.04 MB],[5/1/946/0/56.44 KB/255.94 MB], code: TIMEOUT, fragmentId=F00

WARN (AutoStatistic|25) [StatisticExecutor.collectStatistics():250] Collect statistics error
com.starrocks.sql.analyzer.SemanticException: Getting analyzing error. Detail message: Statistics query fail | Error Message [INTERNAL_ERROR] | QueryId [52541757-6c20-11ee-be8f-00163e16d5da] | SQL [SELECT cast(4 as INT), cast(762744 as BIGINT), ‘id’, cast(COUNT(1) as BIGINT), cast(COUNT(1) * 8 as BIGINT), hex(hll_serialize(IFNULL(hll_raw(id), hll_empty()))), cast(COUNT(1) - COUNT(id) as BIGINT), IFNULL(MAX(id), ‘’), IFNULL(MIN(id), ‘’) FROM test.tablename partition p20231016].
at com.starrocks.statistic.StatisticExecutor.executeDQL(StatisticExecutor.java:313) ~[starrocks-fe.jar:?]
at com.starrocks.statistic.StatisticExecutor.executeStatisticDQL(StatisticExecutor.java:295) ~[starrocks-fe.jar:?]
at com.starrocks.statistic.FullStatisticsCollectJob.collectStatisticSync(FullStatisticsCollectJob.java:122) ~[starrocks-fe.jar:?]
at com.starrocks.statistic.FullStatisticsCollectJob.collect(FullStatisticsCollectJob.java:102) ~[starrocks-fe.jar:?]
at com.starrocks.statistic.StatisticExecutor.collectStatistics(StatisticExecutor.java:248) ~[starrocks-fe.jar:?]
at com.starrocks.statistic.StatisticAutoCollector.runAfterCatalogReady(StatisticAutoCollector.java:87) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.FrontendDaemon.runOneCycle(FrontendDaemon.java:72) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]

有大佬可以帮忙解答一下么,看tablet状态是正常的,但是表无法查询

其他分区的数据查询有问题么?集群状态是正常的么?近期有重启过么?

只有一个分区查不了,其他分区都可以正常查询,集群状态正常的,集群没有重启过。
看日志像是这张表版本不合并了,隔一天查还是一样的告警

可以先将该分区数据重新导入下,恢复下数据,版本是3.1.0是吧

是3.1.0,目前只能重新导入是吧?没有命令可以修复这个分区?

我2.5.12版本也遇到了类似的问题,多个主键表的SQL执行都会超时,将timeout的节点服务停掉之后 数据是可以正常查询;



之前2.5.5 也遇到过 发的帖子:大量SELECT、ISNERT卡住、超时

有具体的报错日志吗?

请问这个问题应该怎样解决,我也是主键表,更新频繁。然后所有be都挂掉

什么写入方式?flink的事务写入方式吗?有同时做alter任务吗?看报错是3029这个版本缺了导致的,升级到最新版本试试?

日志要是还在的话给一下吧

@U_1669452058044_6098
能提供下当时be和fe的日志信息吗?

请问解决了吗 现在 也遇到了这个问题
而且 稳定复现 主键表 部分更新字段写入

您好 请问解决了吗 现在 也遇到了这个问题
而且 稳定复现 主键表 部分更新字段写入 当前版本 2.5.22

复现的步骤提供下?另外不可查是报错还是查不到?截图说明下?

我们生产集群现在每天也遇到这种情况,多个主键表的tablet报错,如下图。然后这张表的相关查询就会超时,手动执行ADMIN SET REPLICA STATUS PROPERTIES()命令,将该be节点上的tablet状态改成bad之后,就能恢复正常。我们集群版本是2.5.11,3fe+7be。这些主键表基本上都是flink或者sparkstreaming实时写入的。

这个一般是publish慢了可能,需要搜下报错tablet的上下文