常见 Crash / BUG / 优化 查询

  1. 使用资源组查询卡住

pstack 有如下堆栈

pstack starrocks_be 进程号

#0  0x00007fe5bc2cca35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000005a229bc in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x37cf64bf8) at /var/local/gcc/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::condition_variable::wait (this=this@entry=0x37cf64bf8, __lock=...) at ../../../.././libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
#5  0x00000000028f0305 in starrocks::pipeline::GlobalDriverExecutor::_worker_thread (this=0xa892ee0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_executor.cpp:86
#6  0x000000000217fef9 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:248
#7  starrocks::FunctionRunnable::run (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:44
#8  starrocks::ThreadPool::dispatch_thread (this=0x19a50000) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/threadpool.cpp:513
#9  0x000000000217baaa in std::function<void ()>::operator()() const (this=0x17fa08d8) at /usr/include/c++/10.3.0/bits/std_function.h:248
#10 starrocks::Thread::supervise_thread (arg=0x17fa08c0) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/util/thread.cpp:326
#11 0x00007fe5bc2c8ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fe5bb8e3b0d in clone () from /lib64/libc.so.6

同时有两个take

#3  0x00000000028f3367 in starrocks::pipeline::QuerySharedDriverQueue::take (this=0x37cf64400) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:95
#4  0x00000000028f3d22 in starrocks::pipeline::WorkGroupDriverQueue::take (this=<optimized out>) at /gaia/workspace-job/git.xiaojukeji.com/datainfra-hadoop/didi-starrock/be/src/exec/pipeline/pipeline_driver_queue.cpp:244
  1. _statistics.column_statistics 表 StatisticsCollectJob Too many versions

2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateFragmentExecStatus():2174] one instance report fail errorCode SERVICE_UNAVAILABLE Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26, query_id=3772d178-8ca4-11ed-854d-6cfe54388271 instance_id=3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,173 WARN (thrift-server-pool-39|12567) [Coordinator.updateStatus():1249] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 3772d178-8ca4-11ed-854d-6cfe54388271, instance id: 3772d178-8ca4-11ed-854d-6cfe54388275
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1338] insert failed: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
2023-01-05 10:54:05,174 WARN (AutoStatistic|38) [StmtExecutor.handleDMLStmt():1415] handle insert stmt fail: insert_3772d178-8ca4-11ed-854d-6cfe54388271
com.starrocks.common.DdlException: Too many versions. tablet_id: 10226, version_count: 1001, limit: 1000: be:XXX.XXX.XXX.26
        at com.starrocks.common.ErrorReport.reportDdlException(ErrorReport.java:80) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.handleDMLStmt(StmtExecutor.java:1339) [starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:471) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticsCollectJob.collectStatisticSync(StatisticsCollectJob.java:92) [starrocks-fe.jar:?]
        at com.starrocks.statistic.FullStatisticsCollectJob.collect(FullStatisticsCollectJob.java:62) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticExecutor.collectStatistics(StatisticExecutor.java:190) [starrocks-fe.jar:?]
        at com.starrocks.statistic.StatisticAutoCollector.runAfterCatalogReady(StatisticAutoCollector.java:61) [starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) [starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]
  1. insert 内存泄漏 (insert 或是 insert into select)

FE Follower 内存泄漏,Leader正常,看内存分布 TxnStateCallbackFactory使用内存比较多

jmap -histo pid


 num     #instances         #bytes  class name
----------------------------------------------
   1:      65039949     6979006048  [C
   2:       4022619     2525925768  [B
   3:      51632292     2478350016  java.util.HashMap
   4:      73877356     1773056544  java.lang.String
   5:      10172354     1546197808  com.starrocks.load.loadv2.InsertLoadJob
   6:      20355243      977051664  com.google.gson.internal.LinkedTreeMap$Node
   7:      20352822      976935456  com.google.gson.internal.LinkedTreeMap
   8:      10727986      935362088  [Ljava.util.HashMap$Node;
   9:      38312936      919510464  java.lang.Long
  10:      10172713      813817488  [Lorg.apache.commons.collections.map.AbstractHashedMap$HashEntry;
  11:      22247256      711912192  java.util.HashMap$Node
  12:      10230960      654781440  java.util.concurrent.ConcurrentHashMap
  13:      10461293      585832408  java.util.LinkedHashMap
  14:      10172712      569671872  com.starrocks.load.EtlStatus
  15:      10172712      569671872  org.apache.commons.collections.map.HashedMap
  16:      10172400      488275200  java.util.concurrent.locks.ReentrantReadWriteLock$FairSync

com.starrocks.load.loadv2.InsertLoadJob 这个占用比较多的,说明是这个问题

  1. Join Reorder Crash

*** Aborted at 1669284722 (unix time) try "date -d @1669284722" if you are using GNU date ***
PC: @          0x319a237 starrocks::serde::ColumnArraySerde::deserialize()
*** SIGSEGV (@0x0) received by PID 24275 (TID 0x7f00a6887700) from PID 0; stack trace: ***
    @          0x3cf85d2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f014b602600 (unknown)
    @          0x319a237 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x319c793 starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x1ee0a78 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x1ee1b6b starrocks::DataStreamRecvr::SenderQueue::add_chunks()
    @          0x1ee3a79 starrocks::DataStreamRecvr::add_chunks()
    @          0x1ec3e3d starrocks::DataStreamMgr::transmit_chunk()
    @          0x1f07abc starrocks::PInternalServiceImpl<>::transmit_chunk()
    @          0x3e2a32e brpc::policy::ProcessRpcRequest()
    @          0x3e20d97 brpc::ProcessInputMessage()
    @          0x3e21c43 brpc::InputMessenger::OnNewMessages()
    @          0x3ec890e brpc::Socket::ProcessEvent()
    @          0x3dd689f bthread::TaskGroup::task_runner()
    @          0x3f5f081 bthread_make_fcontext
  1. 主键/uniqeu/Agg 模型SchemaChange不支持Array

mysql> ALTER TABLE test_add_array_column                                                                                                                                                                                                                                          
    -> ADD COLUMN arr2 ARRAY< varchar (65533)>;                                                                                                                                                                                                                                   
ERROR 1064 (HY000): Unexpected exception: ARRAY<VARCHAR(65533)> must be used in DUP_KEYS   
  1. AVX2 不支持导致 Crash

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1673246592 (unix time) try “date -d @1673246592” if you are using GNU date ***
PC: @ 0x4ab69a4 bitset_container_from_array
*** SIGILL (@0x4ab69a4) received by PID 9331 (TID 0x7f72ef254700) from PID 78342564; stack trace: ***
@ 0x4659ee2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f7309f71630 (unknown)
@ 0x4ab69a4 bitset_container_from_array
@ 0x4a9cc9d roaring_bitmap_add_many
@ 0x2fdd52e starrocks::DelVector::_add_dels()
@ 0x2fddcdc starrocks::DelVector::add_dels_as_new_version()
@ 0x2e5544c starrocks::TabletUpdates::_apply_rowset_commit()
@ 0x2e57482 starrocks::TabletUpdates::do_apply()
@ 0x362cab5 starrocks::ThreadPool::dispatch_thread()
@ 0x3627efa starrocks::thread::supervise_thread()
@ 0x7f7309f69ea5 start_thread
@ 0x7f7309584b0d __clone
@ 0x0 (unknown)
  • 问题原因
    • SIGILL 一般就是BE所在机器不支持 AVX2指令集导致
  • 修复方法
    • 换支持 AVX2 指令集的机器: cat /proc/cpuinfo |grep avx2
    • 关闭 AVX2支持,手动编译 BE

动不动就崩溃,这稳定性堪忧!

  1. Checksum mismatch 错误

Bad page: checksum mismatch (actual=243080401 vs expect=12)
  • 问题原因
    • 一般是磁盘硬件问题,可以查看下 dmesg -T 是否有 I/O 错误: I/O error
[Sat Jan 14 21:30:54 2023] nvme1n1: Write(0x1) @ LBA 174796784, 1016 blocks, Data Transfer Error (sct 0x0 / sc 0x4) DNR 
[Sat Jan 14 21:30:54 2023] blk_update_request: critical target error, dev nvme1n1, sector 174796784 op 0x1:(WRITE) flags 0x4000 phys_seg 127 prio class 0
[Sat Jan 14 21:30:54 2023] EXT4-fs warning (device dm-0): ext4_end_bio:325: I/O error 5 writing to inode 216269336 (offset 8388608 size 8388608 starting block 21849088)
[Sat Jan 14 21:30:54 2023] buffer_io_error: 502 callbacks suppressed
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849088
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849089
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849090
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849091
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849092
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849093
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849094
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849095
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849096
[Sat Jan 14 21:30:54 2023] Buffer I/O error on device dm-0, logical block 21849097
[Sat Jan 14 21:30:54 2023] JBD2: Detected IO errors while flushing file data on dm-0-8
  • 解决办法
    • 更换磁盘
  1. actual row size changed after compaction

W0131 10:10:09.796995 33398 task_worker_pool.cpp:1157] clone failed. signature:2616201
W0131 10:18:19.914609 33340 tablet_updates.cpp:1460] remove_expired_versions failed, tablet updates is in error state: tablet:2616201 actual row size changed after compaction 1697323 -> 1779662 tablet:2616201 #version:2 [58281 58281.181 58281.1] pending:rowsets:4
W0131 10:19:14.006429 33395 engine_clone_task.cpp:145] Fail to lood snapshot:Internal error:load snapshot failed, tablet updates is in error state: tablet:2616201 actual row size changed after compaction 1697323 -> 1779562 tablet:2616201 #version:2 [58281 58281.101 58281.1] pending:rowsets:4
  1. 无法重置 root 密码

ERROR 2013 (HY000): Lost connection to MySQL server at ‘reading authorization packet’, system error: 0
  1. 异步物化视图内存泄漏

jmap -histo:live FE进程id

 num     #instances         #bytes  class name
----------------------------------------------
   1:      52869565     2967542312  [C
   2:      83179155     2661732960  java.util.concurrent.ConcurrentHashMap$Node
   3:      15120953     1541269008  [Ljava.util.concurrent.ConcurrentHashMap$Node;
   4:      52880778     1269138672  java.lang.String
   5:        232083     1153473936  [B
   6:      16837970     1077630080  java.util.concurrent.ConcurrentHashMap
   7:      37361372      896672928  com.starrocks.common.Pair
   8:      35647187      855532488  com.starrocks.common.util.Counter
   9:       9578434      473886784  [Ljava.lang.Object;
  10:      12169451      292066824  java.util.ArrayList
  11:      11814272      283542528  java.util.Collections$SetFromMap
  12:      11814043      283537032  java.util.concurrent.ConcurrentHashMap$KeySetView
  13:       1899199      167129512  com.starrocks.analysis.SlotRef
  14:        981845      141385680  com.starrocks.thrift.TExprNode
  15:       2864417      137492016  java.util.HashMap
  16:       5573428      133762272  java.lang.Long
  17:       2189413      122607128  java.util.LinkedHashMap
  18:       1406488      120784800  [Ljava.util.HashMap$Node;
  19:       3061632       97972224  java.util.HashMap$Node
  20:       1653441       79365168  com.starrocks.common.util.RuntimeProfile
  21:       1707982       68319280  java.util.LinkedHashMap$Entry
  22:       2764850       66356400  com.starrocks.thrift.TNetworkAddress
  23:        499248       63903744  com.starrocks.catalog.Replica
  24:       1955990       62591680  com.starrocks.thrift.TScalarType
  25:       1714766       54872512  java.util.concurrent.locks.ReentrantLock$NonfairSync
  26:       1653505       52912160  java.util.Collections$SynchronizedMap
  27:       1635130       52324160  com.starrocks.sql.analyzer.Field
  1. SET TRANSACTION ISOLATION 失败

syntax to use near ‘ISOLATION’

2023-02-03 10:50:31,865 WARN (starrocks-mysql-nio-pool-330|36506) [ConnectProcessor.handleQuery():334] Process one query failed because. com.starrocks.common.AnalysisException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'ISOLATION' at line 1
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:290) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:430) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessoF.java:676) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambdashandleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_231]
        at java.util.concurrent.ThreadPoolExecutorsWorker.run(ThreadPoolExecutor.java:624) [?:1.8.0_231]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
  1. Select limit 报错 java.lang.NullPointerException: null

2023-01-13 14:18:46,703 WARN(starrocks-mysql-nio-pool-25|285)[StmtExecutor,execute():524] execute Exception, sql select * from strock_ads_sg_bqbb_brand_detial_df limit 10

java.lang.NullPointerException: null
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator.visitPhysicalolapScan(PlanFragmentBuilder.java:539)-[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator.visitPhysical0lapScan(PlanFragmentBuilder,java:229)-[starrocks-fe.jar:?]
        at com.starrocks.sql,optimizer,operator.physical,Physical0lapScanOperator,accept(PhysicalolapScanOperator,java:132) -[starrocks-fe.jar:?]
        at com.starrocks,sql.plan.PlanFragmentBuildersPhysicalPlanTranslator,visit(PlanFragmentBuilder.java:238)-[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator,visitPhysicalDistribution(PlanFragmentBuilder.java:1389) -[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator,visitPhysicalDistribution(PlanFragmentBuilder.java:229) -[starrocks-fe.jar:?]
        at com.starrocks.sql,optimizer.operator.physical,PhysicalDistributionOperator.accept(PhysicalDistributionOperator.java:44) -[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator,visit(PlanFragmentBuilder. java:238) -[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator.visitPhysicalLimit(PlanFragmentBuilder.java:2301) -[starrocks-fe.jar:?]
        at com,starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalLimit(PlanFragmentBuilder,java:229) -[starrocks-fe,jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalLimitOperator.accept(PhysicalLimitOperator.java:33)-[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuildersPhysicalPlanTranslator.visit(PlanFragmentBuilder.java:238) -[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder.createPhysicalPlan(PlanFragmentBuilder,java:163) -[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:115)-[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:65)-[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:39) -[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StatExecutor.java:373)-[starrocks-fe.jar:?]
  1. Broker load 导入报错: mismatched row count

Type:LOAD_RUN_FAIL; msg:mismatched row count: 512 vs 4096
*** Aborted at 1667808981 (unix time) try "date -d @1667808981" if you are using GNU date ***
PC: @          0x24014a3 strings::memcpy_inlined()
*** SIGSEGV (@0x0) received by PID 168026 (TID 0x7f3c1c420700) from PID 0; stack trace: ***
    @          0x507f842 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f3e683c4630 (unknown)
    @          0x24014a3 strings::memcpy_inlined()
    @          0x361cee3 starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2c30b4f starrocks::vectorized::ColumnFactory<>::accept_mutable()
    @          0x361ddc8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x46ec15f starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x3e3be18 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x3e4257c starrocks::DataStreamRecvr::NonPipelineSenderQueue::add_chunks<>()
    @          0x3e3c282 starrocks::DataStreamRecvr::NonPipelineSenderQueue::add_chunks()
    @          0x3dbf073 starrocks::DataStreamRecvr::add_chunks()
    @          0x3d604b6 starrocks::DataStreamMgr::transmit_chunk()
    @          0x4729c3c starrocks::PInternalServiceImplBase<>::transmit_chunk()
    @          0x51b115e brpc::policy::ProcessRpcRequest()
    @          0x51a7b67 brpc::ProcessInputMessage()
    @          0x51a8a13 brpc::InputMessenger::OnNewMessages()
    @          0x524f75e brpc::Socket::ProcessEvent()
    @          0x515d6af bthread::TaskGroup::task_runner()
    @          0x52e60a1 bthread_make_fcontext
  1. Josn 导入 crash

*** Aborted at 1667192760 (unix time) try "date -d @1667192760" if you are using GNU date ***
PC: @          0x27460a1 starrocks::vectorized::JsonDocumentStreamParser::get_current()
*** SIGSEGV (@0x8) received by PID 12653 (TID 0x7fa1a94c1700) from PID 8; stack trace: ***
    @          0x3fa3ad2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa2a1087630 (unknown)
    @          0x27460a1 starrocks::vectorized::JsonDocumentStreamParser::get_current()
    @          0x27455d7 starrocks::vectorized::JsonReader::_read_rows<>()
    @          0x27414d9 starrocks::vectorized::JsonReader::read_chunk()
    @          0x27416ec starrocks::vectorized::JsonScanner::get_next()
    @          0x272e5e0 starrocks::vectorized::FileScanNode::_scanner_scan()
    @          0x272ff4f starrocks::vectorized::FileScanNode::_scanner_worker()
    @          0x5a21410 execute_native_thread_routine
    @     0x7fa2a107fea5 start_thread
    @     0x7fa2a069ab0d __clone
  1. View + Union + Null 时查询报错或Crash

Mismatched row count

*** SIGSEGV (@0x0) received by PID 3659 (TID 0x7f17de2fb700) from PID 0; stack trace: ***
    @          0x3ff4972 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f186229a630 (unknown)
    @     0x7f186190bc00 __memmove_ssse3_back
    @          0x1a26464 starrocks::vectorized::FixedLengthColumnBase<>::append()
    @          0x25224ca starrocks::vectorized::NullableColumn::append()
    @          0x251009b starrocks::vectorized::Chunk::append_safe()
    @          0x27453a7 starrocks::vectorized::ChunksSorterHeapSort::done()
    @          0x27419e5 starrocks::vectorized::ChunksSorter::finish()
    @          0x28ba860 starrocks::pipeline::PartitionSortSinkOperator::set_finishing()
    @          0x28def07 starrocks::pipeline::PipelineDriver::_mark_operator_finishing()
    @          0x28dff3b starrocks::pipeline::PipelineDriver::process()
    @          0x28d67dc starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x21772c9 starrocks::ThreadPool::dispatch_thread()
    @          0x2172e7a starrocks::Thread::supervise_thread()
    @     0x7f1862292ea5 start_thread
    @     0x7f18618ad9fd __clone
    @                0x0 (unknown)
bug的触发条件:
1: 查询首先有view. 
2: view 中有 union 
3: union的孩子有 常量 NULL
4: 这个常量NULL位于union的第一个孩子中.
  1. bitmap_contains 消耗大量内存

terminate called after throwing an instance of 'query_id:b1e35703-a6de-11ed-adfa-78ac4489cf40, fragment_instance:b1e35703-a6de-11ed-adfa-78ac4489cf47
*** Aborted at 1675771249 (unix time) try "date -d @1675771249" if you are using GNU date ***
std::runtime_error'
  what():  failed memory alloc in constructor
PC: @     0x7fe1f947e387 __GI_raise
*** SIGABRT (@0xce40004d5a7) received by PID 316839 (TID 0x7fe13807f700) from PID 316839; stack trace: ***
    @          0x40e1c82 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fe1f9f33630 (unknown)
    @     0x7fe1f947e387 __GI_raise
    @     0x7fe1f947fa78 __GI_abort
    @          0x5ae6dd2 __gnu_cxx::__verbose_terminate_handler()
    @          0x5ae5886 __cxxabiv1::__terminate()
    @          0x5ae58f1 std::terminate()
    @          0x5ae5a96 __cxa_rethrow
    @          0x1653704 _ZNSt8_Rb_treeIjSt4pairIKj7RoaringESt10_Select1stIS3_ESt4lessIjESaIS3_EE7_M_copyINS9_11_Alloc_nodeEEEPSt13_Rb_tree_nodeIS3_EPKSD_PSt18_Rb_tree_node_baseRT_.isra.0.cold
    @          0x212826b starrocks::BitmapValue::BitmapValue()
    @          0x258608b starrocks::vectorized::ObjectColumn<>::append()
    @          0x2586412 starrocks::vectorized::ObjectColumn<>::append_value_multiple_times()
    @          0x293d750 starrocks::pipeline::CrossJoinLeftOperator::_copy_joined_rows_with_index_base_build()
    @          0x293dfc2 starrocks::pipeline::CrossJoinLeftOperator::pull_chunk()
    @          0x2965983 starrocks::pipeline::PipelineDriver::process()
    @          0x295bfb6 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x21c59f9 starrocks::ThreadPool::dispatch_thread()
    @          0x21c15aa starrocks::Thread::supervise_thread()
    @     0x7fe1f9f2bea5 start_thread
    @     0x7fe1f9546b0d __clone
    @                0x0 (unknown)
  1. SQL 解析报错: location

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'location' at line 3
  1. 低基数导致 Plan 改写 Unknown error
ava.lang.IllegalStateException: null
        at com.google.common.base.Preconditions.checkState(Preconditions.java:494) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.sql.plan.ScalarOperatorToExpr$Formatter.visitVariableReference(ScalarOperatorToExpr.java:133) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.ScalarOperatorToExpr$Formatter.visitVariableReference(ScalarOperatorToExpr.java:112) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.scalar.ColumnRefOperator.accept(ColumnRefOperator.java:110) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.ScalarOperatorToExpr.buildExecExpression(ScalarOperatorToExpr.java:79) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.buildPartialTopNFragment(PlanFragmentBuilder.java:1749) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalTopN(PlanFragmentBuilder.java:1664) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalTopN(PlanFragmentBuilder.java:255) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalTopNOperator.accept(PhysicalTopNOperator.java:113) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:264) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalDecode(PlanFragmentBuilder.java:474) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visitPhysicalDecode(PlanFragmentBuilder.java:255) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalDecodeOperator.accept(PhysicalDecodeOperator.java:112) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder$PhysicalPlanTranslator.visit(PlanFragmentBuilder.java:264) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.plan.PlanFragmentBuilder.createPhysicalPlan(PlanFragmentBuilder.java:169) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:110) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:66) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:37) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:373) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:313) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:430) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:676) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
  • Github Issue:

  • Github Fix PR:
  • Jira:

  • 问题版本:
    • 2.4.0 ~ 2.4.3
    • 2.5.0 ~ 2.5.1
  • 修复版本:
    • 2.4.4+
    • 2.5.2+
  • 临时规避方法:
    • set global cbo_enable_low_cardinality_optimize=false; (会影响部分SQL的性能)
  • 问题原因:
    • 低基数查询Plan改写的问题
  1. 同样的SQL, 加Limit比不加Limit性能退化严重
    触发条件:
  • Limit 比较小
  • 有过滤条件
  • 过滤后的结果数据比较少
mysql> select lo_custkey from lineorder_flat where lo_revenue = 3322363 and lo_custkey = 2684693 limit 1;
+------------+
| lo_custkey |
+------------+
|    2684693 |
+------------+
1 row in set (13.25 sec)

mysql> select lo_custkey from lineorder_flat where lo_revenue = 3322363 and lo_custkey = 2684693;
+------------+
| lo_custkey |
+------------+
|    2684693 |
+------------+
1 row in set (0.08 sec)