常见 Crash / BUG / 优化 查询

  1. 主键模型 SchemaChange 后,命中前缀索引的查询,返回结果不对
    加索引,改列类型都有可能触发
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0
  • 修复版本:
    • 2.4.6+
    • 2.5.7+
    • 3.0.1+
  • 临时规避方法:
  • 问题原因:
    • SchemaChange 后没有做HeapMerge,而只是将文件拼接在一起,导致Segment文件排序有问题
1赞
  1. GroupBy tinyint Crash

也有可能结果不对

*** Aborted at 1685449309 (unix time) try "date -d @1685449309" if you are using GNU date ***
PC: @          0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
*** SIGSEGV (@0x10) received by PID 14038 (TID 0x7f3361305700) from PID 16; stack trace: ***
    @          0x6240182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f33df4ed630 (unknown)
    @          0x374c255 starrocks::NullableAggregateFunctionUnary<>::update_batch_selectively()
    @          0x34eeb6e starrocks::Aggregator::compute_batch_agg_states_with_selection()
    @          0x311b8ab starrocks::AggregateBlockingNode::open()
    @          0x4f4fe14 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4f5221d starrocks::PlanFragmentExecutor::open()
    @          0x4e9cb2b starrocks::FragmentExecState::execute()
    @          0x4ea3303 starrocks::FragmentMgr::exec_actual()
    @          0x506b022 starrocks::ThreadPool::dispatch_thread()
    @          0x5065b1a starrocks::Thread::supervise_thread()
    @     0x7f33df4e5ea5 start_thread
    @     0x7f33deb00b0d __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.0 ~ 2.3.12
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0
  • 修复版本:
    • 2.3.13+
    • 2.4.6+
    • 2.5.7+
    • 3.0.1+
  • 临时规避方法:
    • cast(tinyint as int)
  • 问题原因:
    • FixedSizedHashTable优化导致
  1. Select * 与 select count(*) 结果不一致

这种问题的原因一般是 Segment 文件数据排序结果不对,导致通过前缀索引查询出的结果不对.

ScheamChange 修改 Key 列后,查询结果不一致。

  1. Lambda function 内存泄漏
  1. regexp_replace 内存泄漏

  1. Expression child number xxxx exceeded the maximum 10000

Expression child number xxxx exceeded the maximum 10000
  • 涉及版本:
    • 所有版本
  • 解决办法
    • 修改 fe.conf expr_children_limit=50000 (默认值是10000, 改大会有性能和资源过量使用风险)
  • 问题原因
    • 当前会限制最大的子 Expr 数量,防止出现性能和资源过量使用问题。
    • 如果有 insert into values 很多行数据的需求,建议使用 stream load/ routine load/ flink 等导入方式
  1. native_queued_spin_lock_slowpath 占用 CPU 比较高

Perf top 看到这种现象: native_queued_spin_lock_slowpath 占用了大量 CPU

一般在核数比较多的机器,并且并发比较高的场景比较严重

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ latest
    • 2.3.0 ~ latest
  • 修复版本:
    • 2.4+
  • 临时规避方法:
  • 问题原因:
    • TCMalloc 在 Numa 架构下性能不好, 2.4+ 版本已经更换为 Jemalloc
  1. Join runtime filter merge crash

*** Aborted at 1686552759 (unix time) try "date -d @1686552759" if you are using GNU date ***
PC: @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
*** SIGSEGV (@0x207fa2c000) received by PID 40541 (TID 0x7f6fd50cb700) from PID 2141372416; stack trace: ***
    @          0x3f8c022 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f707104a630 (unknown)
    @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
    @          0x33eb37e starrocks::vectorized::RuntimeFilterHelper::fill_runtime_bloom_filter()
    @          0x2a3824a starrocks::pipeline::PartialRuntimeFilterMerger::merge_local_bloom_filters()
    @          0x2a349bf starrocks::pipeline::HashJoinBuildOperator::set_finishing()
    @          0x29df067 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x29dfc85 starrocks::pipeline::PipelineDriver::process()
    @          0x29d65be starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2235c39 starrocks::ThreadPool::dispatch_thread()
    @          0x22317ea starrocks::Thread::supervise_thread()
    @     0x7f7071042ea5 start_thread
    @     0x7f707065d9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 2.2.14
    • 2.3.0 ~ 2.3.13
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0 ~ 3.0.2
  • 修复版本:
    • 2.2.15+
    • 2.3.14+
    • 2.4.6+
    • 2.5.7+
    • 3.0.3+
  • 临时规避方法 (会影响性能):
    • set global enable_global_runtime_filter=false;
    • set global runtime_join_filter_push_down_limit=0;
  • 问题原因:
    • Join 的 on 列是字符串 (右表),并且长度总合大于 4G
  1. Partial update 导致 be 启动 crash

BE 启动加载 Tablet 反复 Crash

*** SIGSEGV (@0x8) received by PID 244327 (TID 0x7facab9fe700) from PID 8; stack trace: ***
@         0x481e332 google::(anonymous namespace)::FailureSignalHandler()
@         0x7facdbc62630 (unknown)
@         0x24d2f7b starrocks::Rowset::do_load()
@         0x24d35cf starrocks::Rowset::load()
@         0x24d3966 starrocks::Rowset::get_segment_iterators2()
@         0x20334ec starrocks::RowsetUpdateState::_do_load()
@         0x2034f78 _ZZSt9call_onceIZN9starrocks17RowsetUpdateState4loadEPNS0_6TabletEPNS0_6RowsetEEUlvE_JEEvRSt9once_flagOT_DpOT0_ENUlvE0_4_FUNEv
@         0x7facdbc5920b __pthread_once_slow
@         0x202fd63 starrocks::RowsetUpdateState::load()
@         0x1e6da98 starrocks::TabletUpdates::_apply_rowset_commit()
@         0x1e73bb3 starrocks::TabletUpdates::do_apply()
@         0x2680af5 starrocks::ThreadPool::dispatch_thread()
@         0x267bf2a starrocks::supervise_thread()
@         0x7facdbc5aea5 start_thread
@         0x7facdb27596d __clone
@         0x0 (unknown)
  1. 使用表达式自动创建分区功能 Crash

类似于这种: PARTITION BY date_trunc('day', dt)
*** Aborted at 1686563917 (unix time) try "date -d @1686563917" if you are using GNU date ***
PC: @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
*** SIGSEGV (@0x4f95) received by PID 725038 (TID 0x7f70fd996700) from PID 20373; stack trace: ***
    @          0x62de642 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f715ea9b370 (unknown)
    @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
    @          0x56f695e starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5710634 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4fd3928 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4fd5a2d starrocks::PlanFragmentExecutor::open()
    @          0x4f1f9bb starrocks::FragmentExecState::execute()
    @          0x4f261e3 starrocks::FragmentMgr::exec_actual()
    @          0x50ed9b2 starrocks::ThreadPool::dispatch_thread()
    @          0x50e84aa starrocks::Thread::supervise_thread()
    @     0x7f715ea93dc5 start_thread
    @     0x7f715e0b476d __clone
    @                0x0 (unknown)
1赞
  1. Librdkafka 创建线程 crash

starrocks_be: rdkafka_broker.c:5702: rd_kafka_broker_add_logical: Assertion `rkb && *"failed to create broker thread"' failed.
*** Aborted at 1680075003 (unix time) try "date -d @1680075003" if you are using GNU date ***
PC: @ 0x7f8d378f4207 __GI_raise
*** SIGABRT (@0x7d10000a58e) received by PID 42382 (TID 0x7f8c6ff86700) from PID 42382; stack trace: ***
 @ 0x354c222 google::(anonymous namespace)::FailureSignalHandler()
 @ 0x7f8d385be5d0 (unknown)
 @ 0x7f8d378f4207 __GI_raise
 @ 0x7f8d378f58f8 __GI_abort
 @ 0x7f8d378ed026 __assert_fail_base
 @ 0x7f8d378ed0d2 __GI___assert_fail
 @ 0x4713d5e rd_kafka_broker_add_logical
 @ 0x475a2ea rd_kafka_cgrp_new
 @ 0x46fcfaf rd_kafka_new
 @ 0x46e78ff RdKafka::KafkaConsumer::create()
 @ 0x1cfdd14 starrocks::KafkaDataConsumer::init()
 @ 0x1ca19ce starrocks::DataConsumerPool::get_consumer()
 @ 0x2ec7d1a starrocks::RoutineLoadTaskExecutor::get_kafka_partition_offset()
 @ 0x1d16075 starrocks::PInternalServiceImpl<>::get_info()
 @ 0x36d7cee brpc::policy::ProcessRpcRequest()
 @ 0x36ce757 brpc::ProcessInputMessage()
 @ 0x36cf603 brpc::InputMessenger::OnNewMessages()
 @ 0x377634e brpc::Socket::ProcessEvent()
 @ 0x368425f bthread::TaskGroup::task_runner()
 @ 0x380cc11 bthread_make_fcontext
  • 问题原因:
    • 线程数到达限制,可以通过 ulimit -u 看下当前的限制是多少
  • 解决方法
    • 修改线程数限制,并重启 BE
  1. JDBC 外表查询 Crash

*** Aborted at 1675922674 (unix time) try "date -d @1675922674" if you are using GNU date ***
PC: @     0x7f9632f0d465 __memcpy_ssse3
*** SIGSEGV (@0x7f918d6fe000) received by PID 30379 (TID 0x7f95579e4700) from PID 18446744071787503616; stack trace: ***
    @          0x56ec9c2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f96343c608b os::Linux::chained_handler()
    @     0x7f96343caacd JVM_handle_linux_signal
    @     0x7f96343bdcd8 signalHandler()
    @     0x7f96338a9630 (unknown)
    @     0x7f9632f0d465 __memcpy_ssse3
    @          0x4cfd328 starrocks::stream_load::OlapTableSink::_print_varchar_error_msg()
    @          0x4cffc09 starrocks::stream_load::OlapTableSink::_validate_data()
    @          0x4d0c093 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4d7def9 starrocks::pipeline::OlapTableSinkOperator::push_chunk()
    @          0x2c39826 starrocks::pipeline::PipelineDriver::process()
    @          0x4d8cff7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x47a3a9d starrocks::ThreadPool::dispatch_thread()
    @          0x479e82a starrocks::Thread::supervise_thread()
    @     0x7f96338a1ea5 start_thread
    @     0x7f9632ebc9fd __clone
    @                0x0 (unknown)
  1. HTTP header is larger than 8192 bytes

[HttpServerHandler.channelRead():70] accept bad request: /api/test/f_l_c_eutrancelltdd_q/_stream_load, error: HTTP header is larger than 8192 bytes.
fe.warn.log:458:com.starrocks.http.HttpRequestException: HTTP header is larger than 8192 bytes
  • 处理方法
    • fe.conf 中配置 http_max_header_size 调大。当前3.0和3.0之前的版本默认是 8192, 3.1+的版本是32768
  1. UDF 报错 Download file’s checksum is not match

  • 问题原因:
    • UDF JAR 包更新过,但没有重新创建 UDF 函数
  • 处理方法:
    • 重新创建 UDF 函数
  1. Hive catalog / 外表,有 or 条件时查询结果不对

  1. Local Shuffle 导致查询结果不对

  1. 自动创建分区 crash
*** Aborted at 1689211640 (unix time) try "date -d @1689211640" if you are using GNU date ***
PC: @     0x7f41521e3e70 __memcmp_avx2_movbe
*** SIGSEGV (@0x7f41ce9efb89) received by PID 2149167 (TID 0x7f40ee501700) from PID 18446744072881109897; stack trace: ***
    @          0x61d1c02 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f4152befc20 (unknown)
    @     0x7f41521e3e70 __memcmp_avx2_movbe
    @          0x307c958 starrocks::BinaryColumnBase<>::compare_at()
    @          0x55efcfe starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5609a14 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4f037f8 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4f058fd starrocks::PlanFragmentExecutor::open()
    @          0x4e5012b starrocks::FragmentExecState::execute()
    @          0x4e56903 starrocks::FragmentMgr::exec_actual()
    @          0x501f4b2 starrocks::ThreadPool::dispatch_thread()
    @          0x5019faa starrocks::Thread::supervise_thread()
    @     0x7f4152be517a start_thread
    @     0x7f4152186dc3 __GI___clone
    @                0x0 (unknown)
  1. 导入 cancel 或是超时导致 BE crash

*** Aborted at 1689139416 (unix time) try "date -d @1689139416" if you are using GNU date ***
PC: @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
*** SIGSEGV (@0x2a0) received by PID 59421 (TID 0x7fa2a17c9700) from PID 672; stack trace: ***
    @          0x487d742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa32f8cb630 (unknown)
    @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
    @          0x3c7abab starrocks::stream_load::OlapTableSink::cancel()
    @          0x36cce82 starrocks::PlanFragmentExecutor::cancel()
    @          0x363ce47 starrocks::FragmentMgr::cancel()
    @          0x363df7a starrocks::FragmentMgr::cancel_worker()
    @          0x6372d20 execute_native_thread_routine
    @     0x7fa32f8c3ea5 start_thread
    @     0x7fa32eede9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.7 ~ 2.3.9
    • 2.4.3 ~ 2.4.4
  • 修复版本:
    • 2.3.10+
    • 2.4.5+
  • 临时规避方法:
  • 问题原因:
    • 导入非Pipeline引擎,快速Cancel逻辑有问题 (2.3, 2.4 导入是非Pipeline引擎)
  1. BE 磁盘空间异常增长,重启后恢复
  1. 物化视图查询改写报错

2023-02-21 06:15:02,865 WARN (starrocks-mysql-nio-pool-462|16527) [StmtExecutor.execute():522] execute Exception, sql select * from xxx limit 11
java.lang.NullPointerException: null
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForPartitionedMv(MaterializedView.java:820) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:808) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getUpdatedPartitionNamesOfTable(MaterializedView.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:798) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.MvRewritePreprocessor.prepareMvCandidatesForPlan(MvRewritePreprocessor.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.prepare(Optimizer.java:193) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:88) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:37) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:373) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:313) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:430) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:676) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]