常见 Crash / BUG / 优化 查询

  1. FE CPU打满,大量查询超时

Jstack 有如下堆栈

"starrocks-mysql-nio-pool-2791" #119533 daemon prio=5 os_prio=0 tid=0x00007fe65c060800 nid=0x19244 runnable [0x00007fe629881000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.hashCode(Arrays.java:4146)
        at java.util.Objects.hash(Objects.java:128)
        at com.starrocks.sql.optimizer.base.DistributionCol.hashCode(DistributionCol.java:116)
        at java.util.HashMap.hash(HashMap.java:340)
        at java.util.HashMap.get(HashMap.java:558)
        at com.starrocks.sql.optimizer.base.DistributionDisjointSet.find(DistributionDisjointSet.java:59)
        at com.starrocks.sql.optimizer.base.DistributionDisjointSet.union(DistributionDisjointSet.java:73)
        at com.starrocks.sql.optimizer.base.DistributionSpec$PropertyInfo.unionNullRelaxCols(DistributionSpec.java:98)
        at com.starrocks.sql.optimizer.OutputPropertyDeriver.computeHashJoinDistributionPropertyInfo(OutputPropertyDeriver.java:183)
        at com.starrocks.sql.optimizer.OutputPropertyDeriver.visitPhysicalJoin(OutputPropertyDeriver.java:259)
        at com.starrocks.sql.optimizer.OutputPropertyDeriver.visitPhysicalHashJoin(OutputPropertyDeriver.java:199)
        at com.starrocks.sql.optimizer.OutputPropertyDeriver.visitPhysicalHashJoin(OutputPropertyDeriver.java:76)
        at com.starrocks.sql.optimizer.operator.physical.PhysicalHashJoinOperator.accept(PhysicalHashJoinOperator.java:41)
        at com.starrocks.sql.optimizer.OutputPropertyDeriver.getOutputProperty(OutputPropertyDeriver.java:95)
        at com.starrocks.sql.optimizer.task.EnforceAndCostTask.execute(EnforceAndCostTask.java:206)
        at com.starrocks.sql.optimizer.task.SeriallyTaskScheduler.executeTasks(SeriallyTaskScheduler.java:69)
        at com.starrocks.sql.optimizer.Optimizer.memoOptimize(Optimizer.java:571)
        at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:188)
        at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:126)
        at com.starrocks.sql.StatementPlanner.createQueryPlanWithReTry(StatementPlanner.java:203)
        at com.starrocks.sql.StatementPlanner.planQuery(StatementPlanner.java:119)
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:88)
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:57)
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:436)
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:362)
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:476)
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:742)
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69)
        at com.starrocks.mysql.nio.ReadListener$Lambda$737/1304093818.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

   Locked ownable synchronizers:
        - <0x00000004efd52070> (a java.util.concurrent.ThreadPoolExecutor$Worker)
  1. 3.2 版本,有非等值 on 条件的 join 结果不对

  1. The tablet write operation update metadata take a long time

The tablet write operation update metadata take a long time
  1. Spark load 导致 FE 死锁

Jstack 有如下堆栈

"starrocks-mysql-nio-pool-76":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000030a0b4430> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at com.starrocks.common.util.QueryableReentrantReadWriteLock.sharedLock(QueryableReentrantReadWriteLock.java:43)
        at com.starrocks.catalog.Database.readLock(Database.java:182)
        at com.starrocks.load.loadv2.BulkLoadJob.checkAndSetDataSourceInfo(BulkLoadJob.java:171)
        at com.starrocks.load.loadv2.BulkLoadJob.fromLoadStmt(BulkLoadJob.java:162)
        at com.starrocks.load.loadv2.LoadMgr.createLoadJobFromStmt(LoadMgr.java:144)
        at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.lambda$visitLoadStatement$16(DDLStmtExecutor.java:370)
        at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor$Lambda$1580/1903605390.apply(Unknown Source)
        at com.starrocks.common.ErrorReport.wrapWithRuntimeException(ErrorReport.java:112)
        at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitLoadStatement(DDLStmtExecutor.java:360)
        at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitLoadStatement(DDLStmtExecutor.java:163)
        at com.starrocks.sql.ast.LoadStmt.accept(LoadStmt.java:346)
        at com.starrocks.qe.DDLStmtExecutor.execute(DDLStmtExecutor.java:149)
        at com.starrocks.qe.StmtExecutor.handleDdlStmt(StmtExecutor.java:1420)
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:595)
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:374)
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:480)
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:756)
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69)
        at com.starrocks.mysql.nio.ReadListener$Lambda$1089/609879095.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
        
 "thrift-server-pool-2863":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000030475d590> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at com.starrocks.load.loadv2.LoadMgr.readLock(LoadMgr.java:698)
        at com.starrocks.load.loadv2.LoadMgr.getLoadJob(LoadMgr.java:635)
        at com.starrocks.leader.LeaderImpl.finishRealtimePush(LeaderImpl.java:546)
        at com.starrocks.leader.LeaderImpl.finishTask(LeaderImpl.java:275)
        at com.starrocks.service.FrontendServiceImpl.finishTask(FrontendServiceImpl.java:1082)
        at com.starrocks.thrift.FrontendService$Processor$finishTask.getResult(FrontendService.java:3621)
        at com.starrocks.thrift.FrontendService$Processor$finishTask.getResult(FrontendService.java:3601)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)       
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.19

    • 3.0.0 ~ 3.0.9

    • 3.1.0 ~ 3.1.9

    • 3.2.0 ~ 3.2.6

  • 修复版本:

    • 2.5.20+

    • 3.0.10+

    • 3.1.10+

    • 3.2.7+

  • 问题原因:

    • Spark load 任务多容易触发
  • 临时解决办法:

  1. Left join 错误的转成 inner join 导致查询结果不对

  1. 低基数改写报错

2024-02-20 09:32:18,029 WARN (starrocks-mysql-nio-pool-331|28388) [StmtExecutor.execute():551] execute Exception, sql SELECT CASE WHEN assignee_id = '' THEN '' ELSE SUBSTR(MD5(assignee_id), 1, 8) END AS sample_value FROM data_center.mart_board_issues_basic LIMIT 10
java.lang.IllegalStateException: null
        at com.google.common.base.Preconditions.checkState(Preconditions.java:496) ~[spark-dpp-1.0.0.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.DictMappingRewriter.rewriteAsDictMapping(DictMappingRewriter.java:67) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.DictMappingRewriter.rewrite(DictMappingRewriter.java:46) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.rewriteOneScalarOperatorForProjection(AddDecodeNodeForDictStringRule.java:560) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.rewriteProjectOperator(AddDecodeNodeForDictStringRule.java:489) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitProjectionAfter(AddDecodeNodeForDictStringRule.java:262) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalOlapScan(AddDecodeNodeForDictStringRule.java:458) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalOlapScan(AddDecodeNodeForDictStringRule.java:171) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalOlapScanOperator.accept(PhysicalOlapScanOperator.java:138) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalDistribution(AddDecodeNodeForDictStringRule.java:840) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalDistribution(AddDecodeNodeForDictStringRule.java:171) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalDistributionOperator.accept(PhysicalDistributionOperator.java:44) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalLimit(AddDecodeNodeForDictStringRule.java:308) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule$DecodeVisitor.visitPhysicalLimit(AddDecodeNodeForDictStringRule.java:171) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.operator.physical.PhysicalLimitOperator.accept(PhysicalLimitOperator.java:33) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.rule.tree.AddDecodeNodeForDictStringRule.rewrite(AddDecodeNodeForDictStringRule.java:930) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.physicalRuleRewrite(Optimizer.java:484) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:174) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:95) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.createQueryPlanWithReTry(StatementPlanner.java:181) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.planQuery(StatementPlanner.java:103) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:73) ~[starrocks-fe.jar:?]
        at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:44) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:402) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:327) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:444) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:711) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:55) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_392]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_392]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.19

    • 3.0.0 ~ 3.0.9

    • 3.1.0 ~ 3.1.9

    • 3.2.0 ~ 3.2.6

  • 修复版本:

    • 2.5.20+

    • 3.0.10+

    • 3.1.10+

    • 3.2.7+

  • 问题原因:

  • 临时解决办法:

    • set global cbo_enable_low_cardinality_optimize =false;
  1. nullable处理有问题导致外表 crash

*** Aborted at 1715391277 (unix time) try “date -d @1715391277” if you are using GNU date ***
PC: @ 0x7f09e938d61a __memcpy_ssse3_back
*** SIGSEGV (@0x7effcbcffffa) received by PID 36590 (TID 0x7f0864581700) from PID 18446744072833990650; stack trace: ***
@ 0x5ae06d2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f09ea87bab7 os::Linux::chained_handler()
@ 0x7f09ea883055 JVM_handle_linux_signal
@ 0x7f09ea878383 signalHandler()
@ 0x7f09e9d22630 (unknown)
@ 0x7f09e938d61a __memcpy_ssse3_back
@ 0x4a3f30b starrocks::MysqlRowBuffer::_push_string_normal()
@ 0x5364e14 starrocks::MysqlResultWriter::process_chunk()
@ 0x508ea64 starrocks::pipeline::ResultSinkOperator::push_chunk()
@ 0x2df4b93 starrocks::pipeline::PipelineDriver::process()
@ 0x50a20f3 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4a69e72 starrocks::ThreadPool::dispatch_thread()
@ 0x4a6496a starrocks:
:supervise_thread()
@ 0x7f09e9d1aea5 start_thread
@ 0x7f09e93359fd __clone
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.7

    • 3.0.0 ~ 3.0.2

  • 修复版本:

    • 2.5.8+

    • 3.0.3+

  • 问题原因:

    • 有些列建外表的时候指定义not null,但是实际数据是有null的
  • 临时解决办法:

    • 建外表时,指定列为nullable的
  1. Storage Page Cache 实际使用的内存比限制的大

  1. Tablet 比较多的时候,频繁查 information_schema.tables 表,导致查询/导入变慢

[
    {
        "lockState": "readLocked",
        "slowReadLockCount": 3,
        "dumpThreads": "lockHoldTime: 3465 ms;dump thread: thrift-server-pool-16449, id: 17420
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:27)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:37)
    com.starrocks.catalog.LocalTablet.getDataSize(LocalTablet.java:453)
    com.starrocks.catalog.MaterializedIndex.getDataSize(MaterializedIndex.java:218)
    com.starrocks.catalog.Partition.getDataSize(Partition.java:320)
    com.starrocks.catalog.OlapTable.getDataSize(OlapTable.java:1680)
    com.starrocks.service.InformationSchemaDataSource.genNormalTableInfo(InformationSchemaDataSource.java:367)
    com.starrocks.service.InformationSchemaDataSource.generateTablesInfoResponse(InformationSchemaDataSource.java:324)
    com.starrocks.service.FrontendServiceImpl.getTablesInfo(FrontendServiceImpl.java:1492)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2301)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2281)
    org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
    org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
    com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
;lockHoldTime: 3460 ms;dump thread: thrift-server-pool-16431, id: 17399
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:27)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:37)
    com.starrocks.catalog.LocalTablet.getDataSize(LocalTablet.java:453)
    com.starrocks.catalog.MaterializedIndex.getDataSize(MaterializedIndex.java:218)
    com.starrocks.service.InformationSchemaDataSource.genNormalTableInfo(InformationSchemaDataSource.java:356)
    com.starrocks.service.InformationSchemaDataSource.generateTablesInfoResponse(InformationSchemaDataSource.java:324)
    com.starrocks.service.FrontendServiceImpl.getTablesInfo(FrontendServiceImpl.java:1492)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2301)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2281)
    org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
    org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
    com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
;lockHoldTime: 3485 ms;dump thread: thrift-server-pool-16437, id: 17405
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:27)
    com.starrocks.common.CloseableLock.lock(CloseableLock.java:37)
    com.starrocks.catalog.LocalTablet.getDataSize(LocalTablet.java:453)
    com.starrocks.catalog.MaterializedIndex.getDataSize(MaterializedIndex.java:218)
    com.starrocks.service.InformationSchemaDataSource.genNormalTableInfo(InformationSchemaDataSource.java:356)
    com.starrocks.service.InformationSchemaDataSource.generateTablesInfoResponse(InformationSchemaDataSource.java:324)
    com.starrocks.service.FrontendServiceImpl.getTablesInfo(FrontendServiceImpl.java:1492)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2301)
    com.starrocks.thrift.FrontendService$Processor$getTablesInfo.getResult(FrontendService.java:2281)
    org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
    org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
    com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:311)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
;",
        "lockDbName": "edw_brain_kfc_db",
        "lockWaiters": [

        ]
    }
]
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.20

    • 3.0.0 ~ 3.0.9

    • 3.1.0 ~ 3.1.11

    • 2.2.0 ~ 3.2.6

  • 修复版本:

    • 2.5.21+

    • 3.0.10+

    • 3.1.12+

    • 3.2.7+

  • 问题原因:

  • 解决办法:

    • 当前这个方法,只是优化,还不能彻底解决这个问题。
1赞
  1. 主键模型 compaction 报错

writer add_columns error, tablet=10087, err=Internal error: column 0 is sort key but not find while init segment writer
  1. 主键模型表产生超大的l1文件,导致IO比较重

  1. 主键模型的表,SchemaChange后基于 ShortKeyIndex查询结果不对

  1. Persistent Index 过大导致占用大量磁盘 IO

现像一般是 l1 文件比较大。触发条件是主键模型表有大量的 delete

  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.21

    • 2.0.0 ~ 3.0.9

    • 3.1.0 ~ 3.1.11

    • 3.2.0 ~ 3.2.6

  • 修复版本:

    • 2.5.22+

    • 3.0.10+

    • 3.1.12+

    • 3.2.7+

  • 问题原因:

  • 临时解决办法:

    • 重启 BE 可以缓解
  1. 主键模型 Persistent Index l0 文件反复读写导致磁盘 IO 高

  1. roaring2range 占用大量 CPU 导致主键模型并发性能问题

                         |          |                                                                  --51.02%--starrocks::vectorized::SegmentIterator::_init
                          |          |                                                                            |          
                          |          |                                                                            |--48.93%--starrocks::vectorized::SegmentIterator::_apply_del_vector
                          |          |                                                                            |          |          
                          |          |                                                                            |          |--47.45%--starrocks::vectorized::roaring2range
                          |          |                                                                            |          |          |          
                          |          |                                                                            |          |          |--20.68%--roaring_read_uint32_iterator
                          |          |                                                                            |          |          |          
                          |          |                                                                            |          |          |--4.62%--starrocks::vectorized::SparseRange::add
                          |          |                                                                            |          |          |          
                          |          |                                                                            |          |           --1.68%--std::vector<starrocks::vectorized::Range, std::allocator<starrocks::vectorized::Range> >::_M_realloc_insert<starrocks::vectorized::Range const&>
                          |          |                                                                            |          |          
                          |          |                                                                            |           --1.36%--starrocks::vectorized::SparseRange::add


这个PR可以优化这个问题,但还不能彻底解决。

  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.3.0 ~ latest

    • 2.4.0 ~ latest

    • 2.5.0 ~ latest

    • 3.0.0 ~ latest

    • 3.1.0 ~ 3.1.11

    • 3.2.0 ~ 3.2.6

  • 修复版本:

    • 2.3 未修复

    • 2.4 未修复

    • 2.5 未修复

    • 3.0 未修复

    • 3.1.12+

    • 3.2.7+

  • 问题原因:

  • 临时解决办法:

  1. Grouping sets crash

*** Aborted at 1705967445 (unix time) try “date -d @1705967445” if you are using GNU date ***
PC: @ 0x2d57320 starrocks::vectorized::FixedLengthColumnBase<>::append_selective()
*** SIGSEGV (@0x1000) received by PID 14905 (TID 0x7ffb0cafb700) from PID 4096; stack trace: ***
@ 0x5b97b22 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7ffbd8c54630 (unknown)
@ 0x2d57320 starrocks::vectorized::FixedLengthColumnBase<>::append_selective()
@ 0x50cb458 starrocks::vectorized::NullableColumn::append_selective()
@ 0x50ae6ca starrocks::vectorized::Chunk::append_selective()
@ 0x324dfbe starrocks::pipeline::LocalExchangeSourceOperator::_pull_shuffle_chunk()
@ 0x324e897 starrocks::pipeline::LocalExchangeSourceOperator::pull_chunk()
@ 0x2d906c0 starrocks::pipeline::PipelineDriver::process()
@ 0x51add6a starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x4b968f2 starrocks::ThreadPool::dispatch_thread()
@ 0x4b9138a starrocks:
:supervise_thread()
@ 0x7ffbd8c4cea5 start_thread
@ 0x7ffbd8267b0d __clone
@ 0x0 (unknown)
  1. ThreadResourceMgr 锁导致 BE CPU压不上去,并发性能不行

现像:

  • CPU使用率低

  • 锁冲突严重

  • 并发性能差

#0  0x00007f09c61f675d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f09c61efa79 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x0000000001ea9242 in __gthread_mutex_lock (__mutex=0x7f09c373d688) at /usr/include/c++/10.3.0/x86_64-pc-linux-gnu/bits/gthr-default.h:749
#3  std::mutex::lock (this=0x7f09c373d688) at /usr/include/c++/10.3.0/bits/std_mutex.h:100
#4  std::unique_lock<std::mutex>::lock (this=0x7ef689befd70) at /usr/include/c++/10.3.0/bits/unique_lock.h:138
#5  std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7ef689befd70) at /usr/include/c++/10.3.0/bits/unique_lock.h:68
#6  starrocks::ThreadResourceMgr::unregister_pool (this=0x7f09c373d680, pool=0x7f05bcd373a0) at /root/starrocks/be/src/runtime/thread_resource_mgr.cpp:96
#7  0x0000000001f1c07e in starrocks::RuntimeState::~RuntimeState (this=0x7efa59c5ac10, __in_chrg=<optimized out>) at /root/starrocks/be/src/runtime/exec_env.h:141
#8  0x0000000001ead272 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7efa59c5ac00) at /usr/include/c++/10.3.0/ext/atomicity.h:70
#9  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7efa59c5ac00) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:151
#10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:733
#11 std::__shared_ptr<starrocks::RuntimeState, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:1183
#12 std::shared_ptr<starrocks::RuntimeState>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr.h:121
#13 starrocks::FragmentExecState::~FragmentExecState (this=<optimized out>, __in_chrg=<optimized out>) at /root/starrocks/be/src/runtime/fragment_mgr.cpp:170
#14 0x0000000001eb67eb in std::_Sp_counted_ptr<starrocks::FragmentExecState*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:379
#15 0x000000000192690a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7ef1aeb59840) at /usr/include/c++/10.3.0/ext/atomicity.h:70
#16 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7ef1aeb59840) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:151
#17 0x0000000001eae775 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7efa585fa010, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:245
#18 std::__shared_ptr<starrocks::FragmentExecState, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7efa585fa008, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:1183
#19 std::shared_ptr<starrocks::FragmentExecState>::~shared_ptr (this=0x7efa585fa008, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr.h:121
#20 ~<lambda> (this=0x7efa585fa000, __in_chrg=<optimized out>) at /root/starrocks/be/src/runtime/fragment_mgr.cpp:438
#21 std::_Function_base::_Base_manager<starrocks::FragmentMgr::exec_plan_fragment(const starrocks::TExecPlanFragmentParams&, const StartSuccCallback&, const FinishCallback&)::<lambda()> >::_M_destroy (__victim=...) at /usr/include/c++/10.3.0/bits/std_function.h:176
#22 std::_Function_base::_Base_manager<starrocks::FragmentMgr::exec_plan_fragment(const starrocks::TExecPlanFragmentParams&, const StartSuccCallback&, const FinishCallback&)::<lambda()> >::_M_manager (__op=<optimized out>, __source=..., __dest=...) at /usr/include/c++/10.3.0/bits/std_function.h:200
#23 std::_Function_handler<void(), starrocks::FragmentMgr::exec_plan_fragment(const starrocks::TExecPlanFragmentParams&, const StartSuccCallback&, const FinishCallback&)::<lambda()> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation) (__dest=..., __source=..., __op=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:283
#24 0x0000000001ff7692 in std::_Function_base::~_Function_base (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:245
#25 std::function<void ()>::~function() (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/std_function.h:303
#26 starrocks::FunctionRunnable::~FunctionRunnable (this=<optimized out>, __in_chrg=<optimized out>) at /root/starrocks/be/src/util/threadpool.cpp:41
#27 0x0000000001ff7192 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7efa585fa040) at /root/starrocks/be/src/util/threadpool.cpp:471
#28 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7efa585fa040) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:151
#29 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:733
#30 std::__shared_ptr<starrocks::Runnable, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:1183
#31 std::__shared_ptr<starrocks::Runnable, (__gnu_cxx::_Lock_policy)2>::reset (this=<synthetic pointer>) at /usr/include/c++/10.3.0/bits/shared_ptr_base.h:1301
#32 starrocks::ThreadPool::dispatch_thread (this=0x7f09c4003c00) at /root/starrocks/be/src/util/threadpool.cpp:522
#33 0x0000000001ff298a in std::function<void ()>::operator()() const (this=0x7efe056fd8d8) at /usr/include/c++/10.3.0/bits/std_function.h:248
#34 starrocks::Thread::supervise_thread (arg=0x7efe056fd8c0) at /root/starrocks/be/src/util/thread.cpp:327
#35 0x00007f09c61ed17a in start_thread () from /lib64/libpthread.so.0
#36 0x00007f09c578edf3 in clone () from /lib64/libc.so.6
  1. 主键模型表 sort key 中有重复列,导致 BE Crash

如这种:

CREATE TABLE orders2 (
    order_id bigint NOT NULL,
    dt date NOT NULL,
    merchant_id int NOT NULL,
    user_id int NOT NULL,
    good_id int NOT NULL,
    good_name string NOT NULL,
    price int NOT NULL,
    cnt int NOT NULL,
    revenue int NOT NULL,
    state tinyint NOT NULL
)
PRIMARY KEY (order_id,dt,merchant_id)
PARTITION BY date_trunc('day', dt)
DISTRIBUTED BY HASH (merchant_id)
ORDER BY (dt,merchant_id,dt) //dt是重复的
PROPERTIES (
    "enable_persistent_index" = "true"
);
*** Aborted at 1710928318 (unix time) try "date -d @1710928318" if you are using GNU date ***
PC: @          0x58f1556 _ZZN9starrocksL17prepare_ops_datasERKNS_6SchemaERKSt6vectorIjSaIjEERKNS_5ChunkEPS3_IPFvPKviPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEESaISL_EEPS3_ISC_SaISC_EEENUlSC_iSJ_E6_4_FUNESC_iSJ_
*** SIGSEGV (@0x8) received by PID 36293 (TID 0x701034335640) from PID 8; stack trace: ***
    @          0x7cd4b2a google::(anonymous namespace)::FailureSignalHandler()
    @     0x70112d442520 (unknown)
    @          0x58f1556 _ZZN9starrocksL17prepare_ops_datasERKNS_6SchemaERKSt6vectorIjSaIjEERKNS_5ChunkEPS3_IPFvPKviPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEESaISL_EEPS3_ISC_SaISC_EEENUlSC_iSJ_E6_4_FUNESC_iSJ_
    @          0x58f365d starrocks::PrimaryKeyEncoder::encode_sort_key()
    @          0x5acf518 starrocks::MergeEntry<>::next()
    @          0x5ad2999 starrocks::RowsetMergerImpl<>::_do_merge_horizontally()
    @          0x5ad44fc starrocks::RowsetMergerImpl<>::_do_merge_vertically()
    @          0x5ad651c starrocks::RowsetMergerImpl<>::do_merge()
    @          0x5ac894f starrocks::compaction_merge_rowsets()
    @          0x3ebfcfc starrocks::TabletUpdates::_do_compaction()
    @          0x3ec13a6 starrocks::TabletUpdates::compaction()
    @          0x3c95451 starrocks::StorageEngine::_perform_update_compaction()
    @          0x3cb4607 starrocks::StorageEngine::_update_compaction_thread_callback()
    @          0xa34af34 execute_native_thread_routine
    @     0x70112d494ac3 (unknown)
    @     0x70112d526850 (unknown)
    @                0x0 (unknown)
  • Github Issue:

  • Github Fix PR:

  • Jira

  • 问题版本:

    • 2.5.0 ~ 2.5.20

    • 3.0.0 ~ latest

    • 3.1.0 ~ 3.1.10

    • 3.2.0 ~ 3.2.6

  • 修复版本:

    • 2.5.21+

    • 3.0 未修复

    • 3.1.11+

    • 3.2.7+

  • 问题原因:

  • 解决办法:

    • Drop table force 清除掉有问题的表,如果BE启动失败,用meta_tool清除掉有问题的Tablet,并升级,只升级不能解决问题,需要清除有问题的Tablet后再升级。
  1. 跨集群数据同步导致 BE crash

*** Aborted at 1716360239 (unix time) try "date -d @1716360239" if you are using GNU date ***
PC: @          0x5072261 starrocks::ReplicationUtils::calc_column_unique_id_map<>()
*** SIGSEGV (@0x18) received by PID 186073 (TID 0x2af0f7202700) from PID 24; stack trace: ***
    @          0x67749a2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2aeeec3c2630 (unknown)
    @          0x5072261 starrocks::ReplicationUtils::calc_column_unique_id_map<>()
    @          0x506d87a starrocks::ReplicationTxnManager::replicate_remote_snapshot()
    @          0x506e168 starrocks::ReplicationTxnManager::replicate_snapshot()
    @          0x341f4d0 starrocks::run_replicate_snapshot_task()
    @          0x2e79d7c starrocks::ThreadPool::dispatch_thread()
    @          0x2e739fa starrocks::Thread::supervise_thread()
    @     0x2aeeec3baea5 start_thread
    @     0x2aeeecff596d __clone
    @                0x0 (unknown)
  1. str_to_jodatime crash

select str_to_jodatime('2014-12-21 12:34:56', 'yyyy-MM-dd HH:mm:ss');
** Aborted at 1703578849 (unix time) try "date -d @1703578849" if you are using GNU date ***
PC: @          0x4aac80a _ZNSt17_Function_handlerIFbvEZN9starrocks4joda10JodaFormat7prepareESt17basic_string_viewIcSt11char_traitsIcEEEUlvE11_E9_M_invokeERKSt9_Any_data
*** SIGSEGV (@0x0) received by PID 32502 (TID 0x7f0104fda700) from PID 0; stack trace: ***
    @          0x5e5a102 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f020d5bb7ab os::Linux::chained_handler()
    @     0x7f020d5c028c JVM_handle_linux_signal
    @     0x7f020d5b3148 signalHandler()
    @     0x7f020ca916d0 (unknown)
    @          0x4aac80a _ZNSt17_Function_handlerIFbvEZN9starrocks4joda10JodaFormat7prepareESt17basic_string_viewIcSt11char_traitsIcEEEUlvE11_E9_M_invokeERKSt9_Any_data
    @          0x4aaf57a starrocks::joda::JodaFormat::parse()
    @          0x55096d2 starrocks::TimeFunctions::parse_jodatime()
    @          0x408dfb4 starrocks::VectorizedFunctionCallExpr::evaluate_checked()
    @          0x37d5c93 starrocks::ExprContext::evaluate()
    @          0x37d5fdf starrocks::ExprContext::evaluate()
    @          0x2b0e864 starrocks::pipeline::ProjectOperator::push_chunk()
    @          0x281ea8c starrocks::pipeline::PipelineDriver::process()
    @          0x53ced3e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x4cc33d2 starrocks::ThreadPool::dispatch_thread()
    @          0x4cbde6a starrocks::Thread::supervise_thread()
    @     0x7f020ca89e25 start_thread
    @     0x7f020be8cbad __clone
    @                0x0 (unknown)