导入大量数据后,查询报错

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】在表中某个分区中通过stream load导入千万级别数据后,可能会导致查询报错

报错1.ERROR 1064 (HY000): invalid encoding type:8
报错2. ERROR 1064 (HY000): num of element information corrupted, _num_element_after_padding:262721024, _num_elements:8
报错3. [42000][1064] Bad page: checksum mismatch (actual=2952600861 vs expect=1277960), file=/data/starRocks/storage/data/365/15004/2129610567/0200000000382bb9cd41910b6e0b0d512336bca4efb70dad_0.dat
【背景】在某个分区插入大量数据
【业务影响】单个分区查询报错
【是否存算分离】否
【StarRocks版本】3.3.0
【集群规模】3fe(1 follower+2observer)+3be(fe与be混部)
【机器信息】16C/64G/万兆
【表模型】主键模型
【导入或者导出方式】steam load
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】
pipeline_driver_executor.cpp:168] [Driver] Process error, query_id=94baf21d-530b-11ef-a917-02421cd749fe, instance_id=94baf21d-530b-11ef-a917-02421cd74a06, status=Corruption: Bad page: checksum mismatch (actual=2952600861 vs expect=1277960), file=/data/starRocks/storage/data/365/15004/2129610567/0200000000382bb9cd41910b6e0b0d512336bca4efb70dad_0.dat
be/src/storage/rowset/scalar_column_iterator.cpp:359 _reader->read_page(_opts, iter.page(), &handle, &page_body, &footer)
be/src/storage/rowset/scalar_column_iterator.cpp:165 _read_data_page(_page_iter)
be/src/storage/rowset/scalar_column_iterator.cpp:584 seek_to_ordinal(*rowids)
be/src/storage/rowset/segment_iterator.cpp:1716 _column_decoders[cid].decode_values_by_rowid(*ordinals, col.get())
be/src/storage/rowset/segment_iterator.cpp:1166 _finish_late_materialization(_context)
be/src/storage/tablet_reader.cpp:258 _collect_iter->get_next(chunk)
be/src/exec/pipeline/scan/olap_chunk_source.cpp:514 _prj_iter->get_next(chunk)
be/src/exec/pipeline/scan/scan_operator.cpp:250 _get_scan_status()
W0805 17:17:47.895830 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a05 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a05 driver=driver_10_0, status=INPUT_EMPTY, operator-chain: [exchange_source_10_0x7fcc260a6b90(O) -> hash_join_build_11_0x7fcc260fdb90(X)(HashJoiner=0x7fcc260e9a10)] cancels operator hash_join_build_11_0x7fcc260fdb90(X)(HashJoiner=0x7fcc260e9a10) with finished error runtime state is cancelled
W0805 17:17:47.895866 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a05 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a05 driver=driver_10_1, status=INPUT_EMPTY, operator-chain: [exchange_source_10_0x7fcc260fef90(O) -> hash_join_build_11_0x7fcc260ffc10(X)(HashJoiner=0x7fcc260ff210)] cancels operator hash_join_build_11_0x7fcc260ffc10(X)(HashJoiner=0x7fcc260ff210) with finished error runtime state is cancelled
W0805 17:17:47.895879 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a05 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a05 driver=driver_10_2, status=INPUT_EMPTY, operator-chain: [exchange_source_10_0x7fcc26100110(O) -> hash_join_build_11_0x7fcc26100b10(X)(HashJoiner=0x7fcc26100890)] cancels operator hash_join_build_11_0x7fcc26100b10(X)(HashJoiner=0x7fcc26100890) with finished error runtime state is cancelled
W0805 17:17:47.895889 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a05 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a05 driver=driver_10_3, status=INPUT_EMPTY, operator-chain: [exchange_source_10_0x7fcc2619c510(O) -> hash_join_build_11_0x7fcc2619d410(X)(HashJoiner=0x7fcc2619cc90)] cancels operator hash_join_build_11_0x7fcc2619d410(X)(HashJoiner=0x7fcc2619cc90) with finished error runtime state is cancelled
W0805 17:17:47.895903 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a05 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a05 driver=driver_10_4, status=INPUT_EMPTY, operator-chain: [exchange_source_10_0x7fcc2619e590(X) -> hash_join_build_11_0x7fcc2619f710(X)(HashJoiner=0x7fcc2619ea90)] cancels operator hash_join_build_11_0x7fcc2619f710(X)(HashJoiner=0x7fcc2619ea90) with finished error runtime state is cancelled
W0805 17:17:47.897559 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_13_0, status=INPUT_EMPTY, operator-chain: [exchange_source_13_0x7fcada710410(O) -> hash_join_build_14_0x7fcada712710(X)(HashJoiner=0x7fcada711a90)] cancels operator hash_join_build_14_0x7fcada712710(X)(HashJoiner=0x7fcada711a90) with finished error runtime state is cancelled
W0805 17:17:47.897600 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_13_1, status=INPUT_EMPTY, operator-chain: [exchange_source_13_0x7fcada713110(O) -> hash_join_build_14_0x7fcada8d4a10(X)(HashJoiner=0x7fcada713d90)] cancels operator hash_join_build_14_0x7fcada8d4a10(X)(HashJoiner=0x7fcada713d90) with finished error runtime state is cancelled
W0805 17:17:47.897632 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_13_2, status=INPUT_EMPTY, operator-chain: [exchange_source_13_0x7fcada8d6f90(O) -> hash_join_build_14_0x7fcada327410(X)(HashJoiner=0x7fcada8d8390)] cancels operator hash_join_build_14_0x7fcada327410(X)(HashJoiner=0x7fcada8d8390) with finished error runtime state is cancelled
W0805 17:17:47.897648 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_13_3, status=INPUT_EMPTY, operator-chain: [exchange_source_13_0x7fcada329490(O) -> hash_join_build_14_0x7fcada463190(X)(HashJoiner=0x7fcada462f10)] cancels operator hash_join_build_14_0x7fcada463190(X)(HashJoiner=0x7fcada462f10) with finished error runtime state is cancelled
W0805 17:17:47.897678 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_13_4, status=INPUT_EMPTY, operator-chain: [exchange_source_13_0x7fcada463690(X) -> hash_join_build_14_0x7fcada463e10(X)(HashJoiner=0x7fcada463910)] cancels operator hash_join_build_14_0x7fcada463e10(X)(HashJoiner=0x7fcada463910) with finished error runtime state is cancelled
W0805 17:17:47.898622 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_16_10, status=INPUT_EMPTY, operator-chain: [aggregate_blocking_source_16_0x7fcada172a10(O) -> project_17_0x7fcada97db10(X) -> local_sort_sink_18_0x7fcada172c90(X)] cancels operator local_sort_sink_18_0x7fcada172c90(X) with finished error runtime state is cancelled
W0805 17:17:47.898662 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_16_11, status=INPUT_EMPTY, operator-chain: [aggregate_blocking_source_16_0x7fcada173690(O) -> project_17_0x7fcada98ec10(X) -> local_sort_sink_18_0x7fcada173b90(X)] cancels operator local_sort_sink_18_0x7fcada173b90(X) with finished error runtime state is cancelled
W0805 17:17:47.898680 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_16_12, status=INPUT_EMPTY, operator-chain: [aggregate_blocking_source_16_0x7fcada174090(O) -> project_17_0x7fcada990d10(X) -> local_sort_sink_18_0x7fcada174310(X)] cancels operator local_sort_sink_18_0x7fcada174310(X) with finished error runtime state is cancelled
W0805 17:17:47.898699 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_16_13, status=INPUT_EMPTY, operator-chain: [aggregate_blocking_source_16_0x7fcada174810(O) -> project_17_0x7fcada99ce10(X) -> local_sort_sink_18_0x7fcada174a90(X)] cancels operator local_sort_sink_18_0x7fcada174a90(X) with finished error runtime state is cancelled
W0805 17:17:47.898723 10901 pipeline_driver.cpp:791] fragment_id 94baf21d-530b-11ef-a917-02421cd74a00 driver query_id=94baf21d-530b-11ef-a917-02421cd749fe fragment_id=94baf21d-530b-11ef-a917-02421cd74a00 driver=driver_16_14, status=INPUT_EMPTY, operator-chain: [aggregate_blocking_source_16_0x7fcada174f90(O) -> project_17_0x7fcada9a2f10(X) -> local_sort_sink_18_0x7fcada175210(X)] cancels operator local_sort_sink_18_0x7fcada175210(X) with finished error runtime state is cancelled
I0805 17:17:47.899969 10875 internal_service.cpp:512] cancel fragment, fragment_instance_id=94baf21d-530b-11ef-a917-02421cd749ff, reason: InternalError
I0805 17:17:47.900004 10872 internal_service.cpp:512] cancel fragment, fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a00, reason: InternalError
I0805 17:17:47.900032 10876 internal_service.cpp:512] cancel fragment, fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a05, reason: InternalError
I0805 17:17:47.900038 10875 internal_service.cpp:512] cancel fragment, fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a06, reason: InternalError
I0805 17:17:47.900104 10878 internal_service.cpp:512] cancel fragment, fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a0a, reason: InternalError
I0805 17:17:47.900486 10899 pipeline_driver_executor.cpp:354] [Driver] Succeed to report exec state: fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a05, is_done=1
I0805 17:17:47.900535 13004 pipeline_driver_executor.cpp:354] [Driver] Succeed to report exec state: fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a00, is_done=1
I0805 17:17:47.900713 10899 pipeline_driver_executor.cpp:343] [Driver] Fail to report exec state due to query not found: fragment_instance_id=94baf21d-530b-11ef-a917-02421cd749ff
I0805 17:17:47.902233 10899 pipeline_driver_executor.cpp:343] [Driver] Fail to report exec state due to query not found: fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a06
I0805 17:17:47.903611 10899 pipeline_driver_executor.cpp:343] [Driver] Fail to report exec state due to query not found: fragment_instance_id=94baf21d-530b-11ef-a917-02421cd74a0a
I0805 17:17:52.709720 10626 daemon.cpp:198] Current memory statistics: process(13850340832), query_pool(0), load(0), metadata(126989144), compaction(0), schema_change(0), column_pool(0), page_cache(10874664896), update(805366), chunk_allocator(2147588400), clone(0), consistency(0), datacache(0), jit(6136)

你好 是存算分离版本吗? 开启的有cache?

存算一体,没有开启cache

我这个背景大概是这样的
通过streamload接口,每批次导入1w条数据,大概三四个线程同时导入,数据量大约为1000多万。
首次导入,也就是目标分区中无数据,这个时候导入没有问题,查询也是正常的。
当后续再次导入,导入的数据量和目标分区接近,be日志中就会出现一些can not find key:xxx的日志,后面这个分区的数据就不可查了,有时候是一台be有问题,有时候三台be都有问题。
目前能通过把对应的tablet状态置为bad,让其自动修复。但是出问题的概率太高了,基本上每天都能碰到

我也有类似的问题存在。
存算一体版本。启用page_cache时,边写边查,会偶发。禁用page_cache就不会有这个问题。

好的,感谢,我禁用掉page_cache试试

关闭page_cache确实没有复现了,感谢

帮看下这个文件,有多大: /data/starRocks/storage/data/365/15004/2129610567/0200000000382bb9cd41910b6e0b0d512336bca4efb70dad_0.dat

能发下具体的报错吗

大佬,我的报错和作者的是一样的。不过版本不同,我是3.1.8+自己优化了写入模式(支持了replace_if_not_null)。最后,同时满足下面3个条件就会报错:

  1. 写入(有新增,有更新),每次写入4万条*5KB的数据
  2. 一直读取最新的部分数据
  3. 开着page_cache,大小设为1GB,2GB,20GB都行

最后跑一段时间就一定会异常。

我们先尝试复现下

我这已经没有现场了 :nauseated_face:

你们用的阿里云的EMR,还是官方的版本

官方版本

我给你打个Jemalloc Debug版本,帮复现下?我本地还是没复现出来

你们还能复现出来吗?

从Segment里的page的读写实现上看,page的checksum不匹配的错误与是否开启be的page_cache无关。但开启了be的page_cache会占用更多的系统内存会影响到系统的page_cache从而影响到了文件的读写性能。

我怀疑这个问题与be里写数据的线程和读数据的线程同时访问到了同一个文件生产了冲突有关。

大佬,我的代码(增加了replace_if_not_null支持)出现类似问题。经过排查,发现是rewrite前,拉取未commit的OrdinalIndex时触发了page_cache缓存。所以导致page_cache中的OrdinalIndex,与dat文件实际的不一致。进而导致了相关的几个问题出现。

该问题在自增主键+部分更新的rewrite逻辑中同样存在。问题原因主要是因为新版本的ScalarColumnIterator没有引用上文opt中传入的use_page_cache,而是直接判断disable_storage_page_cache,以至于所有的查询都会走page_cache。

如PR中解决了这个问题,大佬可以参考看下,有可能是相似的情况。考虑到enable_ordinal_index_memory_page_cache等单独的索引缓存参数的特殊性,所以并没有使用use_page_cache代替config::disable_storage_page_cache,而是增加了一个temporary_data标识是否属于未提交的临时dat

1赞

=我本地已经拿到了Core文件,正在分析,我看下你的实现,稍等。

加我下微信,沟通下这个PR?