常见 Crash / BUG / 优化 查询

  1. native_queued_spin_lock_slowpath 占用 CPU 比较高

Perf top 看到这种现象: native_queued_spin_lock_slowpath 占用了大量 CPU

一般在核数比较多的机器,并且并发比较高的场景比较严重

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ latest
    • 2.3.0 ~ latest
  • 修复版本:
    • 2.4+
  • 临时规避方法:
  • 问题原因:
    • TCMalloc 在 Numa 架构下性能不好, 2.4+ 版本已经更换为 Jemalloc
  1. Join runtime filter merge crash

*** Aborted at 1686552759 (unix time) try "date -d @1686552759" if you are using GNU date ***
PC: @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
*** SIGSEGV (@0x207fa2c000) received by PID 40541 (TID 0x7f6fd50cb700) from PID 2141372416; stack trace: ***
    @          0x3f8c022 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f707104a630 (unknown)
    @          0x33f6c80 starrocks::vectorized::RuntimeBloomFilter<>::insert()
    @          0x33eb37e starrocks::vectorized::RuntimeFilterHelper::fill_runtime_bloom_filter()
    @          0x2a3824a starrocks::pipeline::PartialRuntimeFilterMerger::merge_local_bloom_filters()
    @          0x2a349bf starrocks::pipeline::HashJoinBuildOperator::set_finishing()
    @          0x29df067 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x29dfc85 starrocks::pipeline::PipelineDriver::process()
    @          0x29d65be starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2235c39 starrocks::ThreadPool::dispatch_thread()
    @          0x22317ea starrocks::Thread::supervise_thread()
    @     0x7f7071042ea5 start_thread
    @     0x7f707065d9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 2.2.14
    • 2.3.0 ~ 2.3.13
    • 2.4.0 ~ 2.4.5
    • 2.5.0 ~ 2.5.6
    • 3.0.0 ~ 3.0.2
  • 修复版本:
    • 2.2.15+
    • 2.3.14+
    • 2.4.6+
    • 2.5.7+
    • 3.0.3+
  • 临时规避方法 (会影响性能):
    • set global enable_global_runtime_filter=false;
    • set global runtime_join_filter_push_down_limit=0;
  • 问题原因:
    • Join 的 on 列是字符串 (右表),并且长度总合大于 4G
  1. Partial update 导致 be 启动 crash

BE 启动加载 Tablet 反复 Crash

*** SIGSEGV (@0x8) received by PID 244327 (TID 0x7facab9fe700) from PID 8; stack trace: ***
@         0x481e332 google::(anonymous namespace)::FailureSignalHandler()
@         0x7facdbc62630 (unknown)
@         0x24d2f7b starrocks::Rowset::do_load()
@         0x24d35cf starrocks::Rowset::load()
@         0x24d3966 starrocks::Rowset::get_segment_iterators2()
@         0x20334ec starrocks::RowsetUpdateState::_do_load()
@         0x2034f78 _ZZSt9call_onceIZN9starrocks17RowsetUpdateState4loadEPNS0_6TabletEPNS0_6RowsetEEUlvE_JEEvRSt9once_flagOT_DpOT0_ENUlvE0_4_FUNEv
@         0x7facdbc5920b __pthread_once_slow
@         0x202fd63 starrocks::RowsetUpdateState::load()
@         0x1e6da98 starrocks::TabletUpdates::_apply_rowset_commit()
@         0x1e73bb3 starrocks::TabletUpdates::do_apply()
@         0x2680af5 starrocks::ThreadPool::dispatch_thread()
@         0x267bf2a starrocks::supervise_thread()
@         0x7facdbc5aea5 start_thread
@         0x7facdb27596d __clone
@         0x0 (unknown)
  1. 使用表达式自动创建分区功能 Crash

类似于这种: PARTITION BY date_trunc('day', dt)
*** Aborted at 1686563917 (unix time) try "date -d @1686563917" if you are using GNU date ***
PC: @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
*** SIGSEGV (@0x4f95) received by PID 725038 (TID 0x7f70fd996700) from PID 20373; stack trace: ***
    @          0x62de642 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f715ea9b370 (unknown)
    @          0x30d5585 starrocks::BinaryColumnBase<>::compare_at()
    @          0x56f695e starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5710634 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4fd3928 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4fd5a2d starrocks::PlanFragmentExecutor::open()
    @          0x4f1f9bb starrocks::FragmentExecState::execute()
    @          0x4f261e3 starrocks::FragmentMgr::exec_actual()
    @          0x50ed9b2 starrocks::ThreadPool::dispatch_thread()
    @          0x50e84aa starrocks::Thread::supervise_thread()
    @     0x7f715ea93dc5 start_thread
    @     0x7f715e0b476d __clone
    @                0x0 (unknown)
1赞
  1. Librdkafka 创建线程 crash

starrocks_be: rdkafka_broker.c:5702: rd_kafka_broker_add_logical: Assertion `rkb && *"failed to create broker thread"' failed.
*** Aborted at 1680075003 (unix time) try "date -d @1680075003" if you are using GNU date ***
PC: @ 0x7f8d378f4207 __GI_raise
*** SIGABRT (@0x7d10000a58e) received by PID 42382 (TID 0x7f8c6ff86700) from PID 42382; stack trace: ***
 @ 0x354c222 google::(anonymous namespace)::FailureSignalHandler()
 @ 0x7f8d385be5d0 (unknown)
 @ 0x7f8d378f4207 __GI_raise
 @ 0x7f8d378f58f8 __GI_abort
 @ 0x7f8d378ed026 __assert_fail_base
 @ 0x7f8d378ed0d2 __GI___assert_fail
 @ 0x4713d5e rd_kafka_broker_add_logical
 @ 0x475a2ea rd_kafka_cgrp_new
 @ 0x46fcfaf rd_kafka_new
 @ 0x46e78ff RdKafka::KafkaConsumer::create()
 @ 0x1cfdd14 starrocks::KafkaDataConsumer::init()
 @ 0x1ca19ce starrocks::DataConsumerPool::get_consumer()
 @ 0x2ec7d1a starrocks::RoutineLoadTaskExecutor::get_kafka_partition_offset()
 @ 0x1d16075 starrocks::PInternalServiceImpl<>::get_info()
 @ 0x36d7cee brpc::policy::ProcessRpcRequest()
 @ 0x36ce757 brpc::ProcessInputMessage()
 @ 0x36cf603 brpc::InputMessenger::OnNewMessages()
 @ 0x377634e brpc::Socket::ProcessEvent()
 @ 0x368425f bthread::TaskGroup::task_runner()
 @ 0x380cc11 bthread_make_fcontext
  • 问题原因:
    • 线程数到达限制,可以通过 ulimit -u 看下当前的限制是多少
  • 解决方法
    • 修改线程数限制,并重启 BE
  1. JDBC 外表查询 Crash

*** Aborted at 1675922674 (unix time) try "date -d @1675922674" if you are using GNU date ***
PC: @     0x7f9632f0d465 __memcpy_ssse3
*** SIGSEGV (@0x7f918d6fe000) received by PID 30379 (TID 0x7f95579e4700) from PID 18446744071787503616; stack trace: ***
    @          0x56ec9c2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f96343c608b os::Linux::chained_handler()
    @     0x7f96343caacd JVM_handle_linux_signal
    @     0x7f96343bdcd8 signalHandler()
    @     0x7f96338a9630 (unknown)
    @     0x7f9632f0d465 __memcpy_ssse3
    @          0x4cfd328 starrocks::stream_load::OlapTableSink::_print_varchar_error_msg()
    @          0x4cffc09 starrocks::stream_load::OlapTableSink::_validate_data()
    @          0x4d0c093 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4d7def9 starrocks::pipeline::OlapTableSinkOperator::push_chunk()
    @          0x2c39826 starrocks::pipeline::PipelineDriver::process()
    @          0x4d8cff7 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x47a3a9d starrocks::ThreadPool::dispatch_thread()
    @          0x479e82a starrocks::Thread::supervise_thread()
    @     0x7f96338a1ea5 start_thread
    @     0x7f9632ebc9fd __clone
    @                0x0 (unknown)
  1. HTTP header is larger than 8192 bytes

[HttpServerHandler.channelRead():70] accept bad request: /api/test/f_l_c_eutrancelltdd_q/_stream_load, error: HTTP header is larger than 8192 bytes.
fe.warn.log:458:com.starrocks.http.HttpRequestException: HTTP header is larger than 8192 bytes
  • 处理方法
    • fe.conf 中配置 http_max_header_size 调大。当前3.0和3.0之前的版本默认是 8192, 3.1+的版本是32768
  1. UDF 报错 Download file’s checksum is not match

  • 问题原因:
    • UDF JAR 包更新过,但没有重新创建 UDF 函数
  • 处理方法:
    • 重新创建 UDF 函数
  1. Hive catalog / 外表,有 or 条件时查询结果不对

  1. Local Shuffle 导致查询结果不对

  1. 自动创建分区 crash
*** Aborted at 1689211640 (unix time) try "date -d @1689211640" if you are using GNU date ***
PC: @     0x7f41521e3e70 __memcmp_avx2_movbe
*** SIGSEGV (@0x7f41ce9efb89) received by PID 2149167 (TID 0x7f40ee501700) from PID 18446744072881109897; stack trace: ***
    @          0x61d1c02 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f4152befc20 (unknown)
    @     0x7f41521e3e70 __memcmp_avx2_movbe
    @          0x307c958 starrocks::BinaryColumnBase<>::compare_at()
    @          0x55efcfe starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5609a14 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4f037f8 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4f058fd starrocks::PlanFragmentExecutor::open()
    @          0x4e5012b starrocks::FragmentExecState::execute()
    @          0x4e56903 starrocks::FragmentMgr::exec_actual()
    @          0x501f4b2 starrocks::ThreadPool::dispatch_thread()
    @          0x5019faa starrocks::Thread::supervise_thread()
    @     0x7f4152be517a start_thread
    @     0x7f4152186dc3 __GI___clone
    @                0x0 (unknown)
  1. 导入 cancel 或是超时导致 BE crash

*** Aborted at 1689139416 (unix time) try "date -d @1689139416" if you are using GNU date ***
PC: @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
*** SIGSEGV (@0x2a0) received by PID 59421 (TID 0x7fa2a17c9700) from PID 672; stack trace: ***
    @          0x487d742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa32f8cb630 (unknown)
    @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
    @          0x3c7abab starrocks::stream_load::OlapTableSink::cancel()
    @          0x36cce82 starrocks::PlanFragmentExecutor::cancel()
    @          0x363ce47 starrocks::FragmentMgr::cancel()
    @          0x363df7a starrocks::FragmentMgr::cancel_worker()
    @          0x6372d20 execute_native_thread_routine
    @     0x7fa32f8c3ea5 start_thread
    @     0x7fa32eede9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.7 ~ 2.3.9
    • 2.4.3 ~ 2.4.4
  • 修复版本:
    • 2.3.10+
    • 2.4.5+
  • 临时规避方法:
  • 问题原因:
    • 导入非Pipeline引擎,快速Cancel逻辑有问题 (2.3, 2.4 导入是非Pipeline引擎)
  1. BE 磁盘空间异常增长,重启后恢复
  1. 物化视图查询改写报错

2023-02-21 06:15:02,865 WARN (starrocks-mysql-nio-pool-462|16527) [StmtExecutor.execute():522] execute Exception, sql select * from xxx limit 11
java.lang.NullPointerException: null
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForPartitionedMv(MaterializedView.java:820) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:808) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getUpdatedPartitionNamesOfTable(MaterializedView.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:798) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.MvRewritePreprocessor.prepareMvCandidatesForPlan(MvRewritePreprocessor.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.prepare(Optimizer.java:193) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:88) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:37) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:373) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:313) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:430) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:676) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]

97. failed to call frontend service

BE 报错

failed to call frontend service

FE 报错

2023-07-23 06:49:00,883 WARN (thrift-server-accept|85) [ThreadPoolManager$LogDiscardPolicy.rejectedExecution():178] Task com.starrocks.common.SRTThreadPoolServer$WorkerProcess@5362a7df rejected from thrift-server-pool java.util.concurrent.ThreadPoolExecutor@6a6406a6[Running, pool size = 4096, active threads = 4096, queued tasks = 0, completed tasks = 4558444]
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 最新
    • 2.3.0 ~ 最新
    • 2.4.0 ~ 最新
  • 修复版本:
    • 2.5.0+
  • 临时规避方法:
    • 修改 fe.conf: thrift_server_max_worker_threads=8192 (默认是4096)
    • 调小 session 变量: parallel_fragment_exec_instance_num
  • 问题原因:
    • Thrift 线程池问题,2.5专门优化过
  1. 使用 replace 函数 crash

*** Aborted at 1688752969 (unix time) try "date -d @1688752969" if you are using GNU date ***
PC: @     0x2baa64381387 __GI_raise
*** SIGABRT (@0x1229c) received by PID 74396 (TID 0x2bac0a06c700) from PID 74396; stack trace: ***
    @          0x596c182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2baa63a30630 (unknown)
    @     0x2baa64381387 __GI_raise
    @     0x2baa64382a78 __GI_abort
    @          0x2af5006 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x7dfef76 __cxxabiv1::__terminate()
    @          0x7dfefe1 std::terminate()
    @          0x7dff134 __cxa_throw
    @          0x2af6ce7 std::__throw_length_error()
    @          0x507c233 starrocks::vectorized::regexp_replace_use_hyperscan()
    @          0x50855d6 starrocks::vectorized::StringFunctions::regexp_replace()
    @          0x3e843c7 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x386b7c7 starrocks::ExprContext::evaluate()
    @          0x3e83f9c starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x386b7c7 starrocks::ExprContext::evaluate()
    @          0x3e83f9c starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x3e4e60f starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x3e4e646 starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x3e4e646 starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x386b85e starrocks::ExprContext::evaluate()
    @          0x2ebde44 starrocks::vectorized::ProjectNode::get_next()
    @          0x4891463 starrocks::PlanFragmentExecutor::_get_next_internal_vectorized()
    @          0x4891850 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x48939ed starrocks::PlanFragmentExecutor::open()
    @          0x47e4f4b starrocks::FragmentExecState::execute()
    @          0x47eb1d3 starrocks::FragmentMgr::exec_actual()
    @          0x49888b2 starrocks::ThreadPool::dispatch_thread()
    @          0x49833aa starrocks::Thread::supervise_thread()
    @     0x2baa63a28ea5 start_thread
    @     0x2baa6444996d __clone
    @                0x0 (unknown)
  1. Thrift rpc 申请大量内存

W0726 14:07:03.697324 16174 mem_hook.cpp:247] large memory alloc: 1347571780 b
ytes, stack:
    @          0x31a4d83  malloc
    @          0x8191535  operator new()
    @          0x27b6d7e  std::__cxx11::basic_string<>::_M_mutate()
    @          0x30af6b7  apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
    @          0x30af84c  apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
    @          0x3318599  apache::thrift::TDispatchProcessor::process()
    @          0x5f0a058  apache::thrift::server::TConnectedClient::run()
    @          0x5f02554  apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
    @          0x5f04d5d  apache::thrift::concurrency::Thread::threadMain()
    @          0x5eea4c6  std::thread::_State_impl<>::_M_run()
    @          0x820a430  execute_native_thread_routine
    @     0x7f77e8ebeea5  start_thread
    @     0x7f77e84d9b0d  __clone
    @              (nil)  (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 最新
    • 2.3.0 ~ 2.3.14
    • 2.4.0 ~ 最新
    • 2.5.0 ~ 2.5.9
    • 3.0.0 ~ 3.0.4
  • 修复版本:
    • 2.2 未修复
    • 2.3.15+
    • 2.4 未修复
    • 2.5.10+
    • 3.0.5+
  • 临时规避方法:
  • 问题原因:
  1. Dup rpc 导致 use-after-free

*** Aborted at 1689305620 (unix time) try "date -d @1689305620" if you are using GNU date ***
PC: @     0x7f548e0ace1f (unknown)
*** SIGABRT (@0x1300e) received by PID 77838 (TID 0x7f53e3e026c0) from PID 77838; stack trace: ***
    @          0x6240182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f548e060fc0 (unknown)
    @     0x7f548e0ace1f (unknown)
    @     0x7f548e060f16 gsignal
    @     0x7f548e04c47f abort
    @          0x2e62ca8 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x89c1436 __cxxabiv1::__terminate()
    @          0x89c14a1 std::terminate()
    @          0x89c1b9f __cxa_pure_virtual
    @          0x56ff27b starrocks::pipeline::PipelineDriverPoller::run_internal()
    @          0x5065b1a starrocks::Thread::supervise_thread()
    @     0x7f548e0ab32a (unknown)
    @     0x7f548e129a60 (unknown)
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.0 ~ 最新
    • 2.4.0 ~ 最新
    • 2.5.0 ~ 2.5.9
    • 3.0.0 ~ 3.0.4
  • 修复版本:
    • 2.3 未修复
    • 2.4 未修复
    • 2.5.10+
    • 3.0.5+
  • 临时规避方法:
  • 问题原因:
  1. FE 元数据目录膨胀

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.0.4~2.0.7
    • 2.1.5~2.1.10
    • 2.2.0~2.2.2
  • 修复版本:
    • 2.0.8+
    • 2.1.11+
    • 2.2.3+
  • 临时规避方法:
    • 将fe/lib目录下的starrocks-bdb-je-7.3.8.jar替换为http://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/je-7.3.7.jar 并重启FE
  • 问题原因:
  1. 主键模型 compaction crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1667573352 (unix time) try "date -d @1667573352" if you are using GNU date ***
PC: @          0x1e72eb0 starrocks::TabletUpdates::_apply_compaction_commit()
*** SIGSEGV (@0x0) received by PID 40683 (TID 0x7efd5f069700) from PID 0; stack trace: ***
    @          0x4820332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7efddbe1e630 (unknown)
    @          0x1e72eb0 starrocks::TabletUpdates::_apply_compaction_commit()
    @          0x1e7425d starrocks::TabletUpdates::do_apply()
    @          0x2681635 starrocks::ThreadPool::dispatch_thread()
    @          0x267ca6a starrocks::Thread::supervise_thread()
    @     0x7efddbe16ea5 start_thread
    @     0x7efddb431b0d __clone
    @                0x0 (unknown)