后端be全部coredump之后 所有节点都启动不了了。

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】物化视图刷新 指定 REFRESH MATERIALIZED VIEW mv_material_participant_hour_report PARTITION START (“2025-01-01”) END (“2025-01-03”) WITH SYNC MODE;
实际从 2024-12-01 开始刷新物化视图了。
最后刷新到 2025-01-02 的数据 提示刷新成功 ,但是 所有节点coredump , 负载load高到700多 。
手动杀了两次 ccpp-abrt 进程 。
昨天be coredump之后 会自动重启恢复 。 今天没有自动重启成功
为了避免coredump 导致系统负载过高 设置了这俩参数:
echo “0” > /proc/sys/kernel/core_uses_pid
echo “” > /proc/sys/kernel/core_pattern
手动启动就启动不了了
错误日志 be.out (69.0 KB)

社区沟通 提示be 加参数
enable_pk_size_tiered_compaction_strategy=false
update_compaction_num_threads_per_disk=0
之后可以启动 。
但是轮番挂掉 ,之后我再次启动 起来了 。
再次刷新 又全挂了 。
REFRESH MATERIALIZED VIEW mv_material_participant_hour_report PARTITION START (“2025-01-03”) END (“2025-01-06”) WITH SYNC MODE;

【背景】
【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.3.8
【集群规模】例如:3fe(1 follower+2observer)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】 社区群4-何明(明天) 银狐 或者邮箱 heming@emar.com ,谢谢
【附件】be.out (103.4 KB) be.WARNING.log.20250115-162915 (3.4 MB) dmesg.txt (1.1 MB)messages (92 KB)

CREATE MATERIALIZED VIEW mv_material_participant_hour_report
PARTITION BY (Report_Date)
REFRESH ASYNC START(“2025-01-21 18:05:00”) EVERY(INTERVAL 10 MINUTE) AS
SELECT
Report_Date,temp.Upload_User_Id AS User_id,role_id,temp.Material_Type,temp.agent_material_id,temp.Platform_Type,temp.Account_Id,HOUR,temp.spec_id,Spec_Name,temp.Customer_Id,temp.Customer_Name,Product_Id,Product_Name,temp.Account_Name,temp.Project_Id,temp.Project_Name,
temp.Spec_Material_Id,temp.media_material_id,IFNULL(temp.Material_Id,bind.material_id) Material_Id,IFNULL(temp.material_name,bind.media_material_name) material_name,Material_Original_Name,IFNULL(b.signature,bind.media_material_md5) AS signature,
yfci.id AS Industry_Id,yfci.industry_name AS Industry_Name,
eu.display_name AS User_Name,editor,editor_name,director,director_name,shoter,shoter_name,
IFNULL(b.img_path,IFNULL(bind.binding_material_url,bind.media_material_url)) AS img_path,b.video_pre_path AS video_pre_path,
Role_Name,IFNULL(pdei.main_dept,ul.dept_name) AS Dept_Name,IFNULL(pdei.main_dept_id,ul.dept_id) AS Dept_Id,Order_Type,script_id,spec_schedule_id,
temp.material_cost,temp.cost,conversion,impression,click,
video_play_count,video_outer_play_count,video_outer_play100_count,deep_conversions_count,page_phone_call_direct_count,apply_pv,page_reservation_count,order_pv,page_consult_count,credit_pv,
video_avg_play_time,video_play_time,Material_Create_Time,creative_count,b.media_material_tag
FROM mv_material_participant_hour_report_01 temp
LEFT JOIN tidb.mbg_core.emarbox_user eu ON temp.Upload_User_Id = eu.user_id
LEFT JOIN tidb.mbg_core.pig_dd_employee_info pdei ON eu.login_name = pdei.dsp_account_no
LEFT JOIN (SELECT stat_date,user_id,MAX(dept_id) dept_id,MAX(dept_name) dept_name FROM tidb.mbg_core.pig_dd_user_dept_info_daily GROUP BY stat_date,user_id ) ul ON ul.user_id=pdei.user_id AND temp.report_date=ul.stat_date
LEFT JOIN tidb.mbg_core.yxt_finance_customer yfc ON temp.Customer_Id = yfc.id
LEFT JOIN tidb.mbg_core.yxt_finance_customer_industry yfci ON yfc.customer_industry_id = yfci.id
LEFT JOIN tidb.mbg_business.agent_material_binding_202101 bind
ON bind.agent_material_id=temp.agent_material_id AND bind.media_id=temp.platform_type AND bind.material_type=temp.material_type AND temp.account_id=bind.account_id
LEFT JOIN (
SELECT 0 AS material_type,image_id AS m_id,img_mark_path AS img_path,NULL AS video_pre_path,img_md5 AS signature FROM tidb.mbg_business.creative_material_image a
UNION ALL
SELECT 1 AS material_type,video_id AS m_id,video_mark_path AS img_path,video_pre_path,video_md5 assignature FROM tidb.mbg_business.creative_material_video a) b
ON IFNULL(temp.material_id,bind.material_id) = b.m_id AND temp.Material_Type=b.material_type
LEFT JOIN (
SELECT 4 AS Platform_Type,material_id,COUNT(1) AS creative_count FROM tidb.yixintui_operate.creative_material_tt_experience GROUP BY material_id
UNION ALL
SELECT 2 AS Platform_Type,material_id,COUNT(1) AS creative_count FROM tidb.yixintui_operate.creative_material_gdt_v3 GROUP BY material_id
UNION ALL
SELECT 5 AS Platform_Type,photo_id AS material_id,COUNT(1) AS creative_count FROM tidb.yixintui_operate.Synads_Ks_Creative GROUP BY photo_id
UNION ALL
SELECT 6 AS Platform_Type,material_id,COUNT(1) AS creative_count FROM tidb.yixintui_operate.creative_material_qc GROUP BY material_id
) AS cma ON temp.Platform_Type = cma.Platform_Type AND temp.agent_material_id = cma.material_id
LEFT JOIN tidb.yixintui_operate.board_design_material_tag_int b
ON temp.spec_material_id = b.spec_material_id ;

快速刷新又报错了 ,只能 force 强制刷新了 , force 全量刷新可能有会出现coredump 了 。
Refresh materialized view mv_material_participant_hour_report failed after retrying 1 times(try-lock 0 times), error-msg : java.lang.IllegalStateException: corrupted partition meta
at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at com.starrocks.connector.partitiontraits.DefaultTraits.getPartitionNameWithPartitionInfo(DefaultTraits.java:117)
at com.starrocks.connector.partitiontraits.DefaultTraits.getUpdatedPartitionNames(DefaultTraits.java:134)
at com.starrocks.connector.partitiontraits.CachedPartitionTraits.lambda$getUpdatedPartitionNames$13(CachedPartitionTraits.java:184)
at com.starrocks.connector.partitiontraits.CachedPartitionTraits.lambda$getCache$0(CachedPartitionTraits.java:89)
at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2406)
at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2404)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
at com.starrocks.connector.partitiontraits.CachedPartitionTraits.getCache(CachedPartitionTraits.java:89)
at com.starrocks.connector.partitiontraits.CachedPartitionTraits.getUpdatedPartitionNames(CachedPartitionTraits.java:184)
at com.starrocks.catalog.MaterializedView.getUpdatedPartitionNamesOfExternalTable(MaterializedView.java:828)
at com.starrocks.catalog.MvRefreshArbiter.needsToRefreshTable(MvRefreshArbiter.java:145)
at com.starrocks.catalog.MvRefreshArbiter.needsToRefreshTable(MvRefreshArbiter.java:47)
at com.starrocks.scheduler.mv.MVPCTRefreshPartitioner.needsRefreshBasedOnNonRefTables(MVPCTRefreshPartitioner.java:227)
at com.starrocks.scheduler.mv.MVPCTRefreshRangePartitioner.getMVPartitionsToRefresh(MVPCTRefreshRangePartitioner.java:193)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.getPartitionsToRefreshForMaterializedView(PartitionBasedMvRefreshProcessor.java:972)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.getPartitionsToRefreshForMaterializedView(PartitionBasedMvRefreshProcessor.java:930)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.checkMvToRefreshedPartitions(PartitionBasedMvRefreshProcessor.java:289)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.doRefreshMaterializedView(PartitionBasedMvRefreshProcessor.java:414)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.doRefreshMaterializedViewWithRetry(PartitionBasedMvRefreshProcessor.java:368)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.doMvRefresh(PartitionBasedMvRefreshProcessor.java:327)
at com.starrocks.scheduler.PartitionBasedMvRefreshProcessor.processTaskRun(PartitionBasedMvRefreshProcessor.java:199)
at com.starrocks.scheduler.TaskRun.executeTaskRun(TaskRun.java:270)
at com.starrocks.scheduler.TaskRunExecutor.lambda$executeTaskRun$0(TaskRunExecutor.java:59)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

查询:REFRESH MATERIALIZED VIEW mv_material_participant_hour_report force WITH SYNC MODE

错误代码: 1064
execute task mv-145413 failed: Refresh materialized view mv_material_participant_hour_report failed after retrying 1 times(try-lock 0 times), error-msg : com.starrocks.sql.analyzer.SemanticException: Getting analyzing error. Detail message: Tablet lost replicas. Check if any backend is down or not. tablet_id: 147434, replicas: 10170:2/-1/2/0:NORMAL:ALIVE,10171:2/-1/2/0:NORMAL:DEAD,10002:2/-1/2/0:NORMAL:DEAD,. Check quorum number failed(OlapTableSink): BeReplicaSize:1, quorum:2.
at com.starrocks.sql.Insert
后端只有一个活的 其他两个都死了
image

REFRESH MATERIALIZED VIEW mv_material_participant_hour_report FORCE WITH SYNC MODE;
强制刷新成功了 。
我还以为会卡死呢 ,竟然可以只刷新昨天没刷新完的其他分区 。

REFRESH MATERIALIZED VIEW mv_material_participant_hour_report PARTITION START (“2025-01-03”) END (“2025-01-04”) WITH SYNC MODE;
为什么不能执行这种命令 上面全量刷新完了,我先试试刷新某个分区 报错
错误代码: 1064
execute task mv-145413 failed: Refresh materialized view mv_material_participant_hour_report failed after retrying 1 times(try-lock 0 times), error-msg : java.lang.IllegalStateException: corrupted partition meta
at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at com.starrocks.connector.partitiontraits.DefaultTraits.getPartitionNameWithPartitionInfo(DefaultTraits.java:117)
at com.starrocks.connector.partitiontraits.DefaultTraits.getUpdatedPartitionNames(DefaultTraits.java:134)

REFRESH MATERIALIZED VIEW mv_material_participant_hour_report WITH SYNC MODE; 这个命令也一样错
又死了俩 be
image

SHOW tablet 147434

SHOW PROC ‘/dbs/-1/-1/partitions/-1/-1/147434’;
错误代码: 1064
Getting analyzing error. Detail message: Unknown proc node path: /dbs/-1/-1/partitions/-1/-1/147434. msg: Unknown database id or name “-1”.

机器不支持avx2,自己编译的版本

查询:REFRESH MATERIALIZED VIEW mv_material_participant_hour_report_02 PARTITION START (“2025-01-01”) END (“2025-01-03”) WITH SYNC MOD…

错误代码: 1064
java.net.SocketTimeoutException: Read timed out

总耗时 5分钟 就报超时了 。