常见 Crash / BUG / 优化 查询

  1. HTTP header is larger than 8192 bytes

[HttpServerHandler.channelRead():70] accept bad request: /api/test/f_l_c_eutrancelltdd_q/_stream_load, error: HTTP header is larger than 8192 bytes.
fe.warn.log:458:com.starrocks.http.HttpRequestException: HTTP header is larger than 8192 bytes
  • 处理方法
    • fe.conf 中配置 http_max_header_size 调大。当前3.0和3.0之前的版本默认是 8192, 3.1+的版本是32768
  1. UDF 报错 Download file’s checksum is not match

  • 问题原因:
    • UDF JAR 包更新过,但没有重新创建 UDF 函数
  • 处理方法:
    • 重新创建 UDF 函数
  1. Hive catalog / 外表,有 or 条件时查询结果不对

  1. Local Shuffle 导致查询结果不对

  1. 自动创建分区 crash
*** Aborted at 1689211640 (unix time) try "date -d @1689211640" if you are using GNU date ***
PC: @     0x7f41521e3e70 __memcmp_avx2_movbe
*** SIGSEGV (@0x7f41ce9efb89) received by PID 2149167 (TID 0x7f40ee501700) from PID 18446744072881109897; stack trace: ***
    @          0x61d1c02 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f4152befc20 (unknown)
    @     0x7f41521e3e70 __memcmp_avx2_movbe
    @          0x307c958 starrocks::BinaryColumnBase<>::compare_at()
    @          0x55efcfe starrocks::OlapTablePartitionParam::find_tablets()
    @          0x5609a14 starrocks::stream_load::OlapTableSink::send_chunk()
    @          0x4f037f8 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x4f058fd starrocks::PlanFragmentExecutor::open()
    @          0x4e5012b starrocks::FragmentExecState::execute()
    @          0x4e56903 starrocks::FragmentMgr::exec_actual()
    @          0x501f4b2 starrocks::ThreadPool::dispatch_thread()
    @          0x5019faa starrocks::Thread::supervise_thread()
    @     0x7f4152be517a start_thread
    @     0x7f4152186dc3 __GI___clone
    @                0x0 (unknown)
  1. 导入 cancel 或是超时导致 BE crash

*** Aborted at 1689139416 (unix time) try "date -d @1689139416" if you are using GNU date ***
PC: @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
*** SIGSEGV (@0x2a0) received by PID 59421 (TID 0x7fa2a17c9700) from PID 672; stack trace: ***
    @          0x487d742 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa32f8cb630 (unknown)
    @          0x3c8965c starrocks::stream_load::NodeChannel::cancel()
    @          0x3c7abab starrocks::stream_load::OlapTableSink::cancel()
    @          0x36cce82 starrocks::PlanFragmentExecutor::cancel()
    @          0x363ce47 starrocks::FragmentMgr::cancel()
    @          0x363df7a starrocks::FragmentMgr::cancel_worker()
    @          0x6372d20 execute_native_thread_routine
    @     0x7fa32f8c3ea5 start_thread
    @     0x7fa32eede9fd __clone
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.7 ~ 2.3.9
    • 2.4.3 ~ 2.4.4
  • 修复版本:
    • 2.3.10+
    • 2.4.5+
  • 临时规避方法:
  • 问题原因:
    • 导入非Pipeline引擎,快速Cancel逻辑有问题 (2.3, 2.4 导入是非Pipeline引擎)
  1. BE 磁盘空间异常增长,重启后恢复
  1. 物化视图查询改写报错

2023-02-21 06:15:02,865 WARN (starrocks-mysql-nio-pool-462|16527) [StmtExecutor.execute():522] execute Exception, sql select * from xxx limit 11
java.lang.NullPointerException: null
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForPartitionedMv(MaterializedView.java:820) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:808) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getUpdatedPartitionNamesOfTable(MaterializedView.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.catalog.MaterializedView.getPartitionNamesToRefreshForMv(MaterializedView.java:798) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.MvRewritePreprocessor.prepareMvCandidatesForPlan(MvRewritePreprocessor.java:61) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.prepare(Optimizer.java:193) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:88) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:95) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:37) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:373) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:313) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:430) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:676) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]

97. failed to call frontend service

BE 报错

failed to call frontend service

FE 报错

2023-07-23 06:49:00,883 WARN (thrift-server-accept|85) [ThreadPoolManager$LogDiscardPolicy.rejectedExecution():178] Task com.starrocks.common.SRTThreadPoolServer$WorkerProcess@5362a7df rejected from thrift-server-pool java.util.concurrent.ThreadPoolExecutor@6a6406a6[Running, pool size = 4096, active threads = 4096, queued tasks = 0, completed tasks = 4558444]
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 最新
    • 2.3.0 ~ 最新
    • 2.4.0 ~ 最新
  • 修复版本:
    • 2.5.0+
  • 临时规避方法:
    • 修改 fe.conf: thrift_server_max_worker_threads=8192 (默认是4096)
    • 调小 session 变量: parallel_fragment_exec_instance_num
  • 问题原因:
    • Thrift 线程池问题,2.5专门优化过
  1. 使用 replace 函数 crash

*** Aborted at 1688752969 (unix time) try "date -d @1688752969" if you are using GNU date ***
PC: @     0x2baa64381387 __GI_raise
*** SIGABRT (@0x1229c) received by PID 74396 (TID 0x2bac0a06c700) from PID 74396; stack trace: ***
    @          0x596c182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2baa63a30630 (unknown)
    @     0x2baa64381387 __GI_raise
    @     0x2baa64382a78 __GI_abort
    @          0x2af5006 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x7dfef76 __cxxabiv1::__terminate()
    @          0x7dfefe1 std::terminate()
    @          0x7dff134 __cxa_throw
    @          0x2af6ce7 std::__throw_length_error()
    @          0x507c233 starrocks::vectorized::regexp_replace_use_hyperscan()
    @          0x50855d6 starrocks::vectorized::StringFunctions::regexp_replace()
    @          0x3e843c7 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x386b7c7 starrocks::ExprContext::evaluate()
    @          0x3e83f9c starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x386b7c7 starrocks::ExprContext::evaluate()
    @          0x3e83f9c starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
    @          0x3e4e60f starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x3e4e646 starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x3e4e646 starrocks::vectorized::VectorizedIfExpr<>::evaluate()
    @          0x386b85e starrocks::ExprContext::evaluate()
    @          0x2ebde44 starrocks::vectorized::ProjectNode::get_next()
    @          0x4891463 starrocks::PlanFragmentExecutor::_get_next_internal_vectorized()
    @          0x4891850 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
    @          0x48939ed starrocks::PlanFragmentExecutor::open()
    @          0x47e4f4b starrocks::FragmentExecState::execute()
    @          0x47eb1d3 starrocks::FragmentMgr::exec_actual()
    @          0x49888b2 starrocks::ThreadPool::dispatch_thread()
    @          0x49833aa starrocks::Thread::supervise_thread()
    @     0x2baa63a28ea5 start_thread
    @     0x2baa6444996d __clone
    @                0x0 (unknown)
  1. Thrift rpc 申请大量内存

W0726 14:07:03.697324 16174 mem_hook.cpp:247] large memory alloc: 1347571780 b
ytes, stack:
    @          0x31a4d83  malloc
    @          0x8191535  operator new()
    @          0x27b6d7e  std::__cxx11::basic_string<>::_M_mutate()
    @          0x30af6b7  apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
    @          0x30af84c  apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
    @          0x3318599  apache::thrift::TDispatchProcessor::process()
    @          0x5f0a058  apache::thrift::server::TConnectedClient::run()
    @          0x5f02554  apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
    @          0x5f04d5d  apache::thrift::concurrency::Thread::threadMain()
    @          0x5eea4c6  std::thread::_State_impl<>::_M_run()
    @          0x820a430  execute_native_thread_routine
    @     0x7f77e8ebeea5  start_thread
    @     0x7f77e84d9b0d  __clone
    @              (nil)  (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.2.0 ~ 最新
    • 2.3.0 ~ 2.3.14
    • 2.4.0 ~ 最新
    • 2.5.0 ~ 2.5.9
    • 3.0.0 ~ 3.0.4
  • 修复版本:
    • 2.2 未修复
    • 2.3.15+
    • 2.4 未修复
    • 2.5.10+
    • 3.0.5+
  • 临时规避方法:
  • 问题原因:
  1. Dup rpc 导致 use-after-free

*** Aborted at 1689305620 (unix time) try "date -d @1689305620" if you are using GNU date ***
PC: @     0x7f548e0ace1f (unknown)
*** SIGABRT (@0x1300e) received by PID 77838 (TID 0x7f53e3e026c0) from PID 77838; stack trace: ***
    @          0x6240182 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f548e060fc0 (unknown)
    @     0x7f548e0ace1f (unknown)
    @     0x7f548e060f16 gsignal
    @     0x7f548e04c47f abort
    @          0x2e62ca8 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x89c1436 __cxxabiv1::__terminate()
    @          0x89c14a1 std::terminate()
    @          0x89c1b9f __cxa_pure_virtual
    @          0x56ff27b starrocks::pipeline::PipelineDriverPoller::run_internal()
    @          0x5065b1a starrocks::Thread::supervise_thread()
    @     0x7f548e0ab32a (unknown)
    @     0x7f548e129a60 (unknown)
    @                0x0 (unknown)
  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.3.0 ~ 最新
    • 2.4.0 ~ 最新
    • 2.5.0 ~ 2.5.9
    • 3.0.0 ~ 3.0.4
  • 修复版本:
    • 2.3 未修复
    • 2.4 未修复
    • 2.5.10+
    • 3.0.5+
  • 临时规避方法:
  • 问题原因:
  1. FE 元数据目录膨胀

  • Github Issue:
  • Github Fix PR:
  • Jira
  • 问题版本:
    • 2.0.4~2.0.7
    • 2.1.5~2.1.10
    • 2.2.0~2.2.2
  • 修复版本:
    • 2.0.8+
    • 2.1.11+
    • 2.2.3+
  • 临时规避方法:
    • 将fe/lib目录下的starrocks-bdb-je-7.3.8.jar替换为http://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/je-7.3.7.jar 并重启FE
  • 问题原因:
  1. 主键模型 compaction crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1667573352 (unix time) try "date -d @1667573352" if you are using GNU date ***
PC: @          0x1e72eb0 starrocks::TabletUpdates::_apply_compaction_commit()
*** SIGSEGV (@0x0) received by PID 40683 (TID 0x7efd5f069700) from PID 0; stack trace: ***
    @          0x4820332 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7efddbe1e630 (unknown)
    @          0x1e72eb0 starrocks::TabletUpdates::_apply_compaction_commit()
    @          0x1e7425d starrocks::TabletUpdates::do_apply()
    @          0x2681635 starrocks::ThreadPool::dispatch_thread()
    @          0x267ca6a starrocks::Thread::supervise_thread()
    @     0x7efddbe16ea5 start_thread
    @     0x7efddb431b0d __clone
    @                0x0 (unknown)
  1. FE follower 内存泄漏

Insert 或 insert into select, 物化视图刷新 导致 FE follower 内存泄漏

grep LoadLabelCleaner fe.log*

如果没有日志输出,或者输出的时间已经非常老,说明已经触发了该问题。

  1. 主键模型 schema change 后 tablet state 未持久化,重启后导致不触发 Compaction, 从而 Too many versions

通过 show tablet 后, curl meta 信息,发现 tablet 一直是NOT_READY状态

"tablet_state": "PB_NOTREADY",

2.5后可以通过下面这个SQL,查看哪些 Tablet 有问题

select be_id, state, count(*) from information_schema.be_tablets group by be_id, state;
73474502        NOTREADY        130
admin execute on 10004 '
for (info in StorageEngine.get_tablet_infos(xxx, yyy)) {
    if (info.state == 0) {
        var t = StorageEngine.get_tablet(info.tablet_id)
        if (t != null) {
            t.set_tablet_state_as_int(0)
            t.save_meta()
            System.print("fix table %(info.table_id) tablet %(info.tablet_id)")
        }
    }
}
';

xxx: tablet_id

yyy: partition_id

1赞
  1. 聚合 convert_hash_set_to_chunk crash

query_id:8b0c470a-245d-11ee-873d-00163e0782a2, fragment_instance:8b0c470a-245d-11ee-873d-00163e0782b7
*** Aborted at 1689569467 (unix time) try "date -d @1689569467" if you are using GNU date ***
PC: @ 0x25f3304 starrocks::vectorized::NullableColumn::deserialize_and_append_batch()
*** SIGSEGV (@0x0) received by PID 3038 (TID 0x7f73c94e9700) from PID 0; stack trace: ***
 @ 0x3f91c22 google::(anonymous namespace)::FailureSignalHandler()
 @ 0x7f744a943235 os::Linux::chained_handler()
 @ 0x7f744a948031 JVM_handle_linux_signal
 @ 0x7f744a93b0c8 signalHandler()
 @ 0x7f7449df2630 (unknown)
 @ 0x25f3304 starrocks::vectorized::NullableColumn::deserialize_and_append_batch()
 @ 0x26d00e3 starrocks::Aggregator::convert_hash_set_to_chunk<>()
 @ 0x2a0e3b3 starrocks::pipeline::AggregateDistinctBlockingSourceOperator::pull_chunk()
 @ 0x29e0633 starrocks::pipeline::PipelineDriver::process()
 @ 0x29d6cde starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
 @ 0x2236199 starrocks::ThreadPool::dispatch_thread()
 @ 0x2231d4a starrocks::Thread::supervise_thread()
 @ 0x7f7449deaea5 start_thread
 @ 0x7f7449405b0d __clone
 @ 0x0 (unknown)
  1. Java UDTF 内存泄漏

  1. 主键模型导入数据后,select count(*) from xxx; 结果跳条

主键模型写入导致副本数据不一致

  1. 主键模型清理过期 rowset crash

*** Aborted at 1655286911 (unix time) try "date -d @1655286911" if you are using GNU date ***
PC: @          0x1a6f2cc starrocks::TabletUpdates::_debug_version_info()
*** SIGSEGV (@0x0) received by PID 5538 (TID 0x7ff9db8a8700) from PID 0; stack trace: ***
    @          0x3f6fad2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7ffa892a0630 (unknown)
    @          0x1a6f2cc starrocks::TabletUpdates::_debug_version_info()
    @          0x1a79d23 starrocks::TabletUpdates::remove_expired_versions()
    @          0x1a43950 starrocks::TabletManager::start_trash_sweep()
    @          0x1a15267 starrocks::StorageEngine::_start_trash_sweep()
    @          0x1bf4e79 starrocks::StorageEngine::_garbage_sweeper_thread_callback()
    @          0x59ed4d0 execute_native_thread_routine
    @     0x7ffa89298ea5 start_thread
    @     0x7ffa888b38dd __clone
    @                0x0 (unknown)