StarRocks的 broker load导入性能测试
--2021-11-25 刘春雷
1、前言
StarRocks目前我们已经使用了一阵了,由于公司的架构问题,我们之前没有打通HDFS 的broker load 的导入流程。此处全是辛酸泪~
近期终于打通了HDFS 的broker load 流程,就进行了相关导入性能的测试。
2、测试结果汇总
1.1、测试结果
【结论】:
- broker load 导入速度比stream load 快
- BE的节点数越多,broker load的速度越快
- 文件格式对broker load的速度影响不大(此处对比csv与parquet)
- broker 服务推荐部署在BE节点,可以减少网络传输
- 如果导入的文件比较少,broker 服务的数量对broker load的速度影响不大,但推荐所有BE均部署broker
- FE:load_parallel_instance_num 等参数会影响broker load的速度,但不宜调整过大,会导致冲突严重
- BE:flush_thread_num_per_store 与 olap_table_sink_send_interval_ms 参数会影响broker load的速度,但影响有限
- BE服务器CPU核数对导入速度有影响,核数越多越快
- BE服务器CPU主频对导入速度有影响,主频数越多越快
- 本次测试最快172w条/s
【总结】:
- 为提高broker load的速度: HDFS的文件格式均可,设置FE、BE相关参数,增加BE节点数,broker 与 BE混合部署,且BE均部署broker 服务,提高单个BE的配置:cpu核数、主频
【测试详细结果如下】:
1.2、集群信息
1.3、HDFS导入架构
【公司的HDFS】: 有IP白名单限制
3个单独的物理机器已经开通了HDFS的IP白名单
【DBA组内的HDFS】: 同样有IP白名单限制
但是FE/BE/Broker的IP均已经开通
1.4、测试的建表SQL
CREATE TABLE xxx
(
slot
varchar(64) NULL COMMENT “”,
platform
int(11) NULL COMMENT “”,
ad_local1
varchar(20) NULL COMMENT “”,
ad_cate2
varchar(20) NULL COMMENT “”,
ad_id
bigint(20) NULL COMMENT “”,
parthasinfo
bigint(20) NULL COMMENT “”,
sloc1
varchar(20) NULL COMMENT “”,
scate2
varchar(20) NULL COMMENT “”,
sellingcitytag
varchar(20) NULL COMMENT “”,
requesttype
varchar(20) NULL COMMENT “”,
ad_viptag
int(11) NULL COMMENT “”,
action_type
varchar(32) NULL COMMENT “”,
data_time
datetime NULL COMMENT “”
) ENGINE=OLAP
DUPLICATE KEY(slot
, platform
, ad_local1
, ad_cate2
)
COMMENT “测试表”
PARTITION BY RANGE(data_time
)
(
PARTITION p20211103 VALUES [(‘2021-11-03 00:00:00’), (‘2021-11-04 00:00:00’)
),
DISTRIBUTED BY HASH(ad_id
, slot
) BUCKETS 48
PROPERTIES (
“replication_num” = “3”,
“dynamic_partition.enable” = “true”,
“dynamic_partition.time_unit” = “DAY”,
“dynamic_partition.time_zone” = “Asia/Shanghai”,
“dynamic_partition.start” = “-30”,
“dynamic_partition.end” = “1”,
“dynamic_partition.prefix” = “p”,
“dynamic_partition.buckets” = “48”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”
);
1.5、HDFS准备文件
【公司的】:
总条数:226229635
总大小:3G
总文件数:12个
文件格式: parquet
单个文件:256M
【公司的】:
总条数:226229635
总大小:19G
总文件数:25个
文件格式: csv
单个文件:782M
【DBA组内的】:
总条数:226229635
总大小:19G
总文件数:25个
文件格式: csv
单个文件:782M
2、DBA的工具导入
2.1、实现方式
公司HDFS–DBA程序实现web方式下载–循环顺序 stream load 上传到StarRocks
2.2、时间情况
下载耗时 :203s
stream load耗时 :712s
行数:226229635
导入速度:226229635/712= 317738 条/s
3、broker load 导入
3.1、broker导入任务-1
【信息】:
集群:A
公司HDFS parquet格式
1个独立的broker
FE相关参数默认
BE数量:6个
【 创建导入任务 】 :
LOAD LABEL xx_202111091625
(
DATA INFILE(“viewfs://xxx/xxx/xxx/xxx/xxx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “3600”
);
【任务结果】:
条数:226229635
速度: 457954 条/s
3.2、broker导入任务-2
【信息】:
集群:A
公司HDFS csv格式
1个独立的broker
FE相关参数默认
BE数量:6个
【 创建导入任务 】 :
LOAD LABEL xx_2_202111121045
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/temp/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);
【任务结果】:
条数:226229635
速度: 424445 条/s
3.3、 broker导入任务-3
【信息】:
集群:A
公司HDFS csv格式
3个独立的broker
FE参数:
load_parallel_instance_num = 16
max_broker_concurrency = 100 默认
BE数量:6个
【 创建导入任务 】
LOAD LABEL xx_2ck_2_202111131100
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/x/xxx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);
【任务结果】:
条数:226229635
速度: 557215 条/s
3.4、 broker导入任务-4
【信息】:
集群:A
公司HDFS csv格式
3个独立的broker
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
BE数量:6个
【 创建导入任务 】
LOAD LABEL xx_202111131417
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/xx/xx_test_csv/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);
【任务结果】:
条数:226229635
速度: 590677 条/s
3.5、broker导入任务-5
【信息】:
集群:A
组内的HDFS csv格式
FE,BE均混合部署broker,共12个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
BE数量:6个
【创建导入任务】
LOAD LABEL xx_202111141354
(
DATA INFILE(“hdfs://xxx:xx/StarRocks/xxx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”
);
【任务结果】:
条数:226229635
速度: 586087 条/s
3.6、broker导入任务-6
【信息】:
集群:A
组内的 HDFS csv格式
FE,BE均混合部署broker,共12个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
BE数量:6个
【 创建导入任务 】
LOAD LABEL xx_202111141454
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);
【任务结果】:
条数:226229635
速度: 608144 条/s
3.7、 broker导入任务-7
【信息】:
集群:A
组内的 HDFS csv格式
BE均混合部署broker,共6个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_2ck_2_202111150918
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 589139 条/s
3.8、 broker导入任务-8
【信息】:
集群:A
组内的 HDFS csv格式
BE均混合部署broker,共6个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx2_202111161526
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 562760 条/s
3.9、 broker导入任务-9
【信息】:
集群:A
组内的HDFS csv格式
BE均混合部署broker,共6个
FE参数:
load_parallel_instance_num = 2
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【创建导入任务】
LOAD LABEL xx_2_202111161640
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 509526 条/s
3.10、 broker导入任务-10
【信息】:
集群:B
组内的HDFS csv格式
FE均混合部署broker,共3个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【创建导入任务】
LOAD LABEL xx_202111181741
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 501617 条/s
3.11、 broker导入任务-11
【信息】:
集群:B
组内的 HDFS csv格式
2个BE均混合部署broker,共2个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_202111191628
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 844140 条/s
3.12、 broker导入任务-12
【信息】:
集群:B
组内的 HDFS csv格式
6个BE均混合部署broker,共6个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:6个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_2_202111191647
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 883709 条/s
3.13、 broker导入任务-13
【信息】:
集群:B
组内的 HDFS csv格式
6个BE混合部署broker,共6个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:9个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_202111231755
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 1508197 条/s
3.14、 broker导入任务-14
【信息】:
集群:B
组内的 HDFS csv格式
9个BE混合部署broker,共9个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:9个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_202111231806
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 1459546 条/s
3.15、 broker导入任务-15
【信息】:
集群:B
组内的 HDFS csv格式
9个BE混合部署broker,共9个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:11个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABELxx202111231824
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 1753718 条/s
3.16、 broker导入任务-16
【信息】:
集群:B
组内的 HDFS csv格式
11个BE混合部署broker,共11个
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:11个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_2_202111231830
(
DATA INFILE(“hdfs://xxx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 1726943 条/s
3.17、 broker导入任务-17
【信息】:
集群:B
公司的 HDFS parquet 格式
3个独立部署的broker
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:11个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_202111251452
(
DATA INFILE(“viewfs://xxx/xx/xx/xx/xx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “hdp_lbg_supin”
)
PROPERTIES
(
“timeout” = “3600”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 1405153 条/s
3.18、 broker导入任务-18
【信息】:
集群:C
公司的 HDFS parquet 格式
3个独立部署的broker
FE参数:
load_parallel_instance_num = 32
max_broker_concurrency = 400
be参数:
flush_thread_num_per_store=8
olap_table_sink_send_interval_ms=0
BE数量:3个
导入任务参数:
“load_mem_limit” = “12884901888” (12G)
【 创建导入任务 】
LOAD LABEL xx_202111251649
(
DATA INFILE(“viewfs://xxx/xx/xx/xxx/xx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”
)
PROPERTIES
(
“timeout” = “3600”,
“load_mem_limit” = “12884901888”
);
【任务结果】:
条数:226229635
速度: 526115 条/s