2.3.10版本 be崩溃以及几个fe问题请教

【详述】集群原本是1.19 ,后面升级到2.3.10 , 查看日志发现三个问题

  1. be 崩溃 std::length_error
  2. fe报 query failed: no global dict
  3. fe报 WARN cannot find task. type: PUBLISH_VERSION, backendId: 10003, signature: 51951650

三个问题的详细日志如附件 ,请问该如何解决? 2与3如果不解决会有什么后果?

1 有找到类似的 常见 Crash / BUG / 优化 查询 - :speech_balloon: StarRocks 用户问答 - StarRocks中文社区论坛 (mirrorship.cn) 第29个 ,但报错不太一致 ,并且2.3.10已经补丁修复了

【背景】有大量的flink stream load写入 , 每5分钟 有600条insert overwriete写入 ,只有用到明细表跟聚合表,更新表 ,指有主键表没用到 ,升级2.3.10前 ,不健康的副本大概20几个 ,不一致 2千多个 , 升级完后差别不大。

【业务影响】目前我们尚未收到业务反馈 ,be崩溃我们及时拉起了。
【StarRocks版本】2.3.10
【集群规模】例如:3fe(1 follower+2observer)+5be
【机器信息】 3fe 16C/32G , 5be 32C/64G/ 5be挂了5个1TB ssd
【联系方式】社区群8-tempo
【附件】

  1. 崩溃日志:
    terminate called recursively
    terminate called after throwing an instance of ‘std::length_error’
    what(): basic_string::_M_create
    query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
    *** Aborted at 1682493788 (unix time) try “date -d @1682493788” if you are using GNU date ***
    PC: @ 0x7fdd04c7a3d7 __GI_raise
    *** SIGABRT (@0x4e96) received by PID 20118 (TID 0x7fdbdfd89700) from PID 20118; stack trace: ***
    @ 0x41b9c62 google::(anonymous namespace)::FailureSignalHandler()
    @ 0x7fdd05bb3630 (unknown)
    @ 0x7fdd04c7a3d7 __GI_raise
    @ 0x7fdd04c7bac8 __GI_abort
    @ 0x1991adb _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @ 0x5ccd4c6 __cxxabiv1::__terminate()
    @ 0x5d71599 __cxa_call_terminate
    @ 0x5cccee1 __gxx_personality_v0
    @ 0x5d781ee _Unwind_RaiseException_Phase2
    @ 0x5d78ce6 _Unwind_Resume
    @ 0x18a518c _ZN4brpc6policy17ProcessRpcRequestEPNS_16InputMessageBaseE.cold
    @ 0x42e2427 brpc::ProcessInputMessage()
    @ 0x42e32d3 brpc::InputMessenger::OnNewMessages()
    @ 0x4389f9e brpc::Socket::ProcessEvent()
    @ 0x4297f2f bthread::TaskGroup::task_runner()
    @ 0x4420711 bthread_make_fcontext

  2. query failed: no global dict

2023-04-26 09:18:41,292 WARN (ForkJoinPool.commonPool-worker-1|15745) [Coordinator.getNext():916] get next fail, need cancel. status errorCode CANCELLED Cancelled BufferControlBlock::cancel, query id: 47834049-e3d0-11ed-8195-fa163e84adb1
2023-04-26 09:18:41,292 WARN (ForkJoinPool.commonPool-worker-1|15745) [Coordinator.getNext():937] query failed: no global dict
2023-04-26 09:18:41,292 WARN (ForkJoinPool.commonPool-worker-1|15745) [StatisticExecutor.executeStmt():391] com.starrocks.common.UserException: no global dict
2023-04-26 09:18:41,292 WARN (thrift-server-pool-6851|15644) [Coordinator.updateFragmentExecStatus():1682] one instance report fail errorCode CANCELLED Cancelled SenderQueue::get_chunk, query_id=47834049-e3d0-11ed-8195-fa163e84adb1 instance_id=47834049-e3d0-11ed-8195-fa163e84adb7

  1. cannot find task. type: PUBLISH_VERSION

fe 与 be时间线

//fe
2023-04-26 14:08:57,166 INFO (thrift-server-pool-8671|17974) [DatabaseTransactionMgr.beginTransaction():301] begin transaction: txn_id: 51699033 with label 68c63cb4-85aa-4b9b-8179-7d288bd84c58 from coordinator BE: xxxx, listner id: -1
2023-04-26 14:08:57,166 INFO (thrift-server-pool-8666|17969) [FrontendServiceImpl.streamLoadPut():1114] receive stream load put request. db:ods, tbl: table, txn_id: 51699033, load id: 4548c857-a405-6beb-1424-330753d552a3, backend: xxxx
2023-04-26 14:08:57,170 INFO (thrift-server-pool-8666|17969) [StreamLoadPlanner.plan():222] load job id: TUniqueId(hi:4992460465679723499, lo:1451341086484419235) tx id 51699033 parallel 0 compress null

//to be

I0426 14:08:57.294407 26091 txn_manager.cpp:204] Commit txn successfully. tablet: 8849141, txn_id: 51699033, rowsetid: 0200000009832db547484b3519a758f867dec7c83415f4a6 #segment:1 #delfile:0
I0426 14:08:57.294926 26735 txn_manager.cpp:204] Commit txn successfully. tablet: 8849149, txn_id: 51699033, rowsetid: 0200000009832db647484b3519a758f867dec7c83415f4a6 #segment:1 #delfile:0

I0426 14:08:59.614212 21232 task_worker_pool.cpp:206] Submit task success. type=PUBLISH_VERSION, signature=51699033, task_count_in_queue=1
I0426 14:08:59.614226 26533 task_worker_pool.cpp:873] get publish version task, signature:51699033 txn_id: 51699033 priority queue size: 1
I0426 14:08:59.614496 26548 engine_publish_version_task.cpp:60] Publish txn success tablet:8849117 version:5072 partition:8849112 txn_id: 51699033 rowset:0200000009832db147484b3519a758f867dec7c83415f4a6
I0426 14:08:59.614691 26540 engine_publish_version_task.cpp:60] Publish txn success tablet:8849121 version:5072 partition:8849112 txn_id: 51699033 rowset:0200000009832db247484b3519a758f867dec7c83415f4a6

I0426 14:08:59.615902 26533 task_worker_pool.cpp:801] Publish version on partition. partition: 8849112, txn_id: 51699033, version: 5072
I0426 14:08:59.615912 26533 task_worker_pool.cpp:902] publish_version success. signature:51699033 txn_id: 51699033 related tablet num: 12 time: 1ms

//back to fe
2023-04-26 14:08:59,609 INFO (thrift-server-pool-7859|17045) [FrontendServiceImpl.loadTxnCommit():943] receive txn commit request. db: ods, tbl: table, txn_id: 51699033, backend: xxxx
2023-04-26 14:08:59,612 INFO (PUBLISH_VERSION|31) [PublishVersionDaemon.publishVersion():86] send publish tasks for txn_id: 51699033
2023-04-26 14:08:59,612 INFO (thrift-server-pool-7859|17045) [DatabaseTransactionMgr.commitTransaction():575] transaction:[TransactionState. txn_id: 51699033, label: 68c63cb4-85aa-4b9b-8179-7d288bd84c58, db id: 10045, table id list: , callback id: -1, coordinator: BE: 10.67.2.69, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1682489337166, commit time: 1682489339609, finish time: -1, publish cost: -1682489339610ms, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@325408c8] successfully committed
2023-04-26 14:09:00,120 INFO (PUBLISH_VERSION|31) [DatabaseTransactionMgr.finishTransaction():927] finish transaction TransactionState. txn_id: 51699033, label: 68c63cb4-85aa-4b9b-8179-7d288bd84c58, db id: 10045, table id list: , callback id: -1, coordinator: BE: xxxx, transaction status: VISIBLE, error replicas num: 12, replica ids: 8849191,8849158,8849124,8849171,8849138, prepare time: 1682489337166, commit time: 1682489339609, finish time: 1682489340116, publish cost: 507ms, reason: attachment: com.starrocks.load.loadv2.ManualLoadTxnCommitAttachment@325408c8 successfully
2023-04-26 14:09:01,880 WARN (thrift-server-pool-221|585) [MasterImpl.finishTask():194] cannot find task. type: PUBLISH_VERSION, backendId: 10004, signature: 5169903

看到了 error replicas num :12

官方更新的太慢了,

这里才到 2.3.8 ,实际上都已经 2.3.12 了

不知道你提到的bug 修复没

1赞

这几天be还是会崩溃 报错如下

tcmalloc: large alloc 1278091264 bytes == 0x5c97ea000 @ 0x5aebeff 0x5d7d59c 0x21b0f48 0x5ccda15 0x1b58b14 0x1b3d296 0x1d37811 0x5d47630
query_id:a12e5451-e9af-11ed-8d9b-fa163e3f958b, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1683117608 (unix time) try “date -d @1683117608” if you are using GNU date ***
PC: @ 0x2f762f0 starrocks::vectorized::ColumnViewer<>::ColumnViewer()
*** SIGSEGV (@0x5e32380) received by PID 16142 (TID 0x7f3ab9383700) from PID 98771840; stack trace: ***
@ 0x41b9c62 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f3ae20c5630 (unknown)
@ 0x2f762f0 starrocks::vectorized::ColumnViewer<>::ColumnViewer()
@ 0x341672a starrocks::vectorized::JsonFunctions::_json_string_unescaped()
@ 0x3417232 starrocks::vectorized::JsonFunctions::get_json_string()
@ 0x33ab228 starrocks::vectorized::VectorizedFunctionCallExpr::evaluate()
@ 0x2f21b3c starrocks::vectorized::VectorizedBinaryPredicate<>::evaluate()
@ 0x2e546bc starrocks::ExprContext::evaluate()
@ 0x1f95c77 starrocks::vectorized::ColumnExprPredicate::evaluate()
@ 0x1c3977a starrocks::vectorized::SegmentIterator::_filter_by_expr_predicates()
@ 0x1c3afc2 starrocks::vectorized::SegmentIterator::_do_get_next()
@ 0x1c3e7f1 starrocks::vectorized::SegmentIterator::do_get_next()
@ 0x1c989d2 starrocks::vectorized::ProjectionIterator::do_get_next()
@ 0x1cd24c4 starrocks::vectorized::UnionIterator::do_get_next()
@ 0x20d37ea starrocks::SegmentIteratorWrapper::do_get_next()
@ 0x1ccc97b starrocks::vectorized::TimedChunkIterator::do_get_next()
@ 0x1cc56ee starrocks::vectorized::TabletReader::do_get_next()
@ 0x2aee7a4 starrocks::vectorized::TabletScanner::get_chunk()
@ 0x279c475 starrocks::vectorized::OlapScanNode::_scanner_thread()
@ 0x20f7fe0 starrocks::PriorityThreadPool::work_thread()
@ 0x4154e47 thread_proxy
@ 0x7f3ae20bdea5 start_thread
@ 0x7f3ae12549fd __clone
@ 0x0 (unknown)

tcmalloc: large alloc 1189404672 bytes == 0x8fd8b0000 @  0x5aebeff 0x5d7d59c 0x21b0f48 0x5ccda15 0x1b58b14 0x1b3d296 0x1d37811 0x5d47630

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1683174613 (unix time) try “date -d @1683174613” if you are using GNU date ***
PC: @ 0x4588a70 google::protobuf::Message::SpaceUsedLong()
*** SIGSEGV (@0x0) received by PID 32190 (TID 0x7f26b4bd7700) from PID 0; stack trace: ***
@ 0x41b9c62 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f27b39a0630 (unknown)
@ 0x4588a70 google::protobuf::Message::SpaceUsedLong()
@ 0x20c9c3b std::_Sp_counted_deleter<>::_M_dispose()
@ 0x1b699da std::_Hashtable<>::clear()
@ 0x1b6760c starrocks::Tablet::revise_tablet_meta()
@ 0x34fd45e starrocks::EngineCloneTask::_clone_incremental_data()
@ 0x3504232 starrocks::EngineCloneTask::_finish_clone()
@ 0x35050d5 starrocks::EngineCloneTask::_do_clone()
@ 0x350600a starrocks::EngineCloneTask::execute()
@ 0x1b33c0e starrocks::StorageEngine::execute_task()
@ 0x262235a starrocks::TaskWorkerPool::_clone_worker_thread_callback()
@ 0x5d47630 execute_native_thread_routine
@ 0x7f27b3998ea5 start_thread
@ 0x7f27b2b2f9fd __clone
@ 0x0 (unknown)

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1682602502 (unix time) try “date -d @1682602502” if you are using GNU date ***
PC: @ 0x1c2adac starrocks::SegmentWriter::_init_column_meta()
*** SIGSEGV (@0x0) received by PID 29492 (TID 0x7f3571ad8700) from PID 0; stack trace: ***
@ 0x41b9c62 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f36c7d92630 (unknown)
@ 0x1c2adac starrocks::SegmentWriter::_init_column_meta()
@ 0x1c2ae62 starrocks::SegmentWriter::_init_column_meta()
@ 0x1c2caf1 starrocks::SegmentWriter::init()
@ 0x1c2d9c1 starrocks::SegmentWriter::init()
@ 0x20dac44 starrocks::HorizontalBetaRowsetWriter::_create_segment_writer()
@ 0x20dc3f3 starrocks::HorizontalBetaRowsetWriter::_flush_chunk()
@ 0x20dc642 starrocks::HorizontalBetaRowsetWriter::flush_chunk()
@ 0x1cc0742 starrocks::vectorized::MemTableRowsetWriterSink::flush_chunk()
@ 0x1cd5058 starrocks::vectorized::MemTable::flush()
@ 0x1d29b33 starrocks::FlushToken::_flush_memtable()
@ 0x1d2a8a0 starrocks::MemtableFlushTask::run()
@ 0x2262d0d starrocks::ThreadPool::dispatch_thread()
@ 0x225e51a starrocks::thread::supervise_thread()
@ 0x7f36c7d8aea5 start_thread
@ 0x7f36c6f219fd __clone
@ 0x0 (unknown

query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1682936820 (unix time) try “date -d @1682936820” if you are using GNU date ***
PC: @ 0x1313963 (unknown)
*** SIGSEGV (@0x0) received by PID 31201 (TID 0x7f0bbb02c700) from PID 0; stack trace: ***
@ 0x41b9c62 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f0c8f61e630 (unknown)
@ 0x1313963 (unknown)

tcmalloc: large alloc 1190682624 bytes == 0x791fe0000 @ 0x5aebeff 0x5d7d59c 0x21b0f48 0x5ccda15 0x1b58b14 0x1b3d296 0x1d37811 0x5d47630
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1682925470 (unix time) try “date -d @1682925470” if you are using GNU date ***
PC: @ 0x30 (unknown)

这个问题根据query id在fe.audit.log中查下是哪个sql,稳定复现的话,提供个 query dump 怎么获取query_dump文件

另外两个问题确认下以下参数是否配置正确
1.swap是否关闭
2.cat /proc/sys/vm/overcommit_memory配置是否为1
3.be是否和其他服务混合部署,如果混合部署,be.conf配置下mem_limit=总内存-其他服务内存-1g

1.swap是否关闭 ->关了
2.cat /proc/sys/vm/overcommit_memory配置是否为1 ->是
3.be是否和其他服务混合部署,如果混合部署,be.conf配置下mem_limit=总内存-其他服务内存-1g ->没有其他大应用 ,不过有些filebeat与阿里云原生的应用 , 我门也有使用grafana监控be内存(starrocks_be_memory_allocated_bytes) ,64g内存 ,最多就用了 30G内 ,剩下都是page cache , dmesg 也没看到kill的

fe.audit.log 这个目前我门二次开发把查询记录关了 ,所以没记录到 ,不过问业务方应该是类似以下sql

CREATE TABLE tabletest (
a varchar(65533) NULL COMMENT ",
b varchar(65533) NULL COMMENT “”,
c datetime NULL COMMENT “”,
d varchar(65533) NULL COMMENT “”,
e varchar(65533) NULL COMMENT “”,
f varchar(65533) NULL COMMENT “”,
g varchar(65533) NULL COMMENT “”,
h varchar(65533) NULL COMMENT “”,
i varchar(65533) NULL COMMENT “”,
j varchar(65533) NULL COMMENT “”,
k varchar(65533) NULL COMMENT “”,
l varchar(65533) NULL COMMENT “”,
m varchar(65533) NULL COMMENT “”,
n varchar(65533) NULL COMMENT “”,
o varchar(65533) NULL COMMENT “”,
date date NULL COMMENT “分区日期”,
p varchar(65533) NULL COMMENT “”,
q varchar(65533) NULL COMMENT “”,
r bitmap BITMAP_UNION NULL COMMENT “”,
s hll HLL_UNION NULL COMMENT “”,
t bigint(20) SUM NULL DEFAULT “0” COMMENT “”,
INDEX index_pic_b (b) USING BITMAP COMMENT ‘’,
INDEX index_pic_bb (i) USING BITMAP COMMENT ‘’,
INDEX index_pic_bbb (l) USING BITMAP COMMENT ‘’
) ENGINE=OLAP
AGGREGATE KEY(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, date, p, q)
COMMENT “”
PARTITION BY RANGE(date)
(PARTITION p20230105 VALUES [(‘2023-01-05’), (‘2023-01-06’))
DISTRIBUTED BY HASH(a, b, c) BUCKETS 10
PROPERTIES (
“replication_num” = “3”,
“bloom_filter_columns” = “a, c, b”,
“dynamic_partition.enable” = “true”,
“dynamic_partition.time_unit” = “DAY”,
“dynamic_partition.time_zone” = “Asia/Shanghai”,
“dynamic_partition.start” = “-120”,
“dynamic_partition.end” = “5”,
“dynamic_partition.prefix” = “p”,
“dynamic_partition.buckets” = “10”,
“dynamic_partition.replication_num” = “3”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”,
“enable_persistent_index” = “false”
);

select hll_union_agg(s) uv
from tabletest
where a = ‘Test_app10_0603’
and date >= ‘2023-01-05’
and date <= ‘2023-04-17’
and get_json_string(q, ‘$.529708baf92440da992fdb27ba426292’) like ‘Test_exp182_%’

select get_json_string(q, ‘$.529708baf92440da992fdb27ba426292’) id, hll_union_agg(s) uv
from tabletest
where a = ‘Test_app10_0603’
and date >= ‘2023-01-05’
and date <= ‘2023-04-17’
and get_json_string(q, ‘$.529708baf92440da992fdb27ba426292’) = ‘Test_exp182_0_1_0603’
group by get_json_string(q, ‘$.529708baf92440da992fdb27ba426292’)

请问下最近还crash吗,crash的话开个coredump分析下