Starrocks v3.1.3 物化视图刷新导致BE挂掉重启

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】Starrocks v3.1.3 物化视图刷新失败,报com.starrocks.common.UserException: [E110]Fail to read from Socket{id=679 fd=605 addr=10.218.2.226:8060:63256} (0x0x7f99154fed40): Connection timed out [R1][E112]Not connected to 10.218.2.226:8060 yet, server_id=679 [R2][E112]Not connected to 10.218.2.226:8060 yet, server_id=679 [R3][E112]Not connected to 10.218.2.226:8060 yet, server_id=679 错误,经过排查,发现10.218.2.226 这台BE的状态是DEAD
【背景】我们在v3.1.3 进行物化视图异步刷新大数据量压力测试中,出现了一个词BE服务挂掉重启的情况
【业务影响】
【StarRocks版本】例如:1.18.2
【集群规模】例如:3fe(1 follower+2observer)+8be
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【附件】
select * from information_schema.task_runs where query_id = ‘e1b6d607-64b8-11ee-a15b-bc97e157f93e’

226_be_10_26-11_05.log (645.6 KB)

物化视图的创建语句还有base表的数据量请您发下

CREATE MATERIALIZED VIEW cdp_event_model_tangtest0929 (code, event_time, superid, mainid, mainid_type, etl_task_id, dt, ts, created_at, updated_at, task_id, event, touch_point, event_body, _brand) PARTITION BY (date_trunc(‘month’, event_time)) DISTRIBUTED BY HASH(superid) BUCKETS 8 REFRESH MANUAL PROPERTIES ( “replicated_storage” = “true”, “replication_num” = “3”, “storage_medium” = “HDD” ) AS SELECT cdp_event_model.code, cdp_event_model.event_time, cdp_event_model.superid, cdp_event_model.mainid, cdp_event_model.mainid_type, cdp_event_model.etl_task_id, cdp_event_model.dt, cdp_event_model.ts, cdp_event_model.created_at, cdp_event_model.updated_at, cdp_event_model.task_id, cdp_event_model.event, cdp_event_model.touch_point, cdp_event_model.event_body, cdp_event_model._brand FROM cdp_demo.cdp_event_model WHERE cdp_event_model.event = ‘tangtest0929’;

base 表的数据量为(5.9亿)592000049;符合where条件的是(0.66亿)66000000

@dongquan BE只出现了一个挂掉重启的情况,之后报了 ava.lang.RuntimeException: create partitions failed: Table creation timed out. You can increase the timeout by increasing the config “tablet_create_timeout_second” and try again. To increase the config “tablet_create_timeout_second” (currently 10), run the following command: admin set frontend config("tablet_create_timeout_second"="20") or add the following configuration to the fe.conf file and restart the process: tablet_create_timeout_second=20 调整到tablet_create_timeout_second =60之后就没有报错信息,是否和之前tablet_create_timeout_second为默认值有关

有新的排查进展吗?

我们的存储全是SSD的,在创建物化视图没有显式指定“storage_medium” = “HDD” 被默认添加了“storage_medium” = “HDD” 是否影响?这是BE的配置

@trueeyu @dongquan

这个问题没被解决吗?
我也是SSD也出现同样问题,怎么解决的 求解