BE failed

node1.txt (1.1 MB)
node2.txt (1.2 MB)

【详述】
kafka to StarRocks
【背景】无
【业务影响】
【StarRocks版本】例如:2.1.3
【集群规模】例如:3fe(1 follower+2observer)+5be(fe与be分开部署)

【附件】
通过kafka load 数据进入SR be 挂掉,

routine load 任务异常信息
ReasonOfStateChanged: ErrorReason{errCode = 3, msg=‘failed to create task: tablet 675319 has few replicas: 1, quorum: 2, cluster: 82555269’}
ErrorLogUrls:
OtherMsg:
节点一 :last heartbeat :2022-06-01 21:45:08
节点二: last heartbeat :2022-06-01 22:33:48

是所有的be都crash了吗? 麻烦能提供下对应的be.out的日志吗?

只有这个两个节点down 机。 log 目录下面没有be.out 文件

另外尝试启动 crash 节点 ,启动后其他节点会随机crash

start time: Wed Jun 1 22:33:23 CST 2022
tcmalloc: large alloc 11725260726272 bytes == (nil) @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x174a0d8 0x20d30d8 0x1b440ff 0x2e544fd 0x2e
43c6b 0x362c72c 0x362d4f8 0x1d786c9 0x1d74288 0x7fe804f7cea5
terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc
*** Aborted at 1654094030 (unix time) try “date -d @1654094030” if you are using GNU date ***
PC: @ 0x7fe8042b93d7 __GI_raise
*** SIGABRT (@0x6777) received by PID 26487 (TID 0x7fe7f57e4700) from PID 26487; stack trace: ***
@ 0x3503022 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fe804f84630 (unknown)
@ 0x7fe8042b93d7 __GI_raise
@ 0x7fe8042baac8 __GI_abort
@ 0x15de049 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
@ 0x4f66726 __cxxabiv1::__terminate()
@ 0x4f66791 std::terminate()
@ 0x4f668e4 __cxa_throw
@ 0x15ddf50 _Znwm.cold
@ 0x174df10 std::vector<>::_M_default_append()
@ 0x174a0d8 starrocks::vectorized::BinaryColumn::append_selective()
@ 0x20d30d8 starrocks::vectorized::NullableColumn::append_selective()
@ 0x1b440ff starrocks::vectorized::MemTable::insert()
@ 0x2e544fd starrocks::vectorized::DeltaWriter::write()
@ 0x2e43c6b starrocks::vectorized::AsyncDeltaWriter::_execute()
@ 0x362c72c bthread::ExecutionQueueBase::_execute()
@ 0x362d4f8 bthread::ExecutionQueueBase::_execute_tasks()
@ 0x1d786c9 starrocks::ThreadPool::dispatch_thread()
@ 0x1d74288 starrocks::thread::supervise_thread()
@ 0x7fe804f7cea5 start_thread
@ 0x7fe8043819fd __clone
@ 0x0 (unknown)
start time: Wed Jun 1 23:02:20 CST 2022
*** Aborted at 1654146832 (unix time) try “date -d @1654146832” if you are using GNU date ***
PC: @ 0x23974ca starrocks::vectorized::JsonDocumentStreamParser::get_current()
*** SIGSEGV (@0x3b02fdb2) received by PID 28451 (TID 0x7f3c6d8b8700) from PID 990051762; stack trace: ***
@ 0x3503022 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f3d19329630 (unknown)
@ 0x23974ca starrocks::vectorized::JsonDocumentStreamParser::get_current()
@ 0x239d82a starrocks::vectorized::JsonReader::_read_chunk_from_document_stream()
@ 0x239ec1d starrocks::vectorized::JsonScanner::get_next()
@ 0x238c7a7 starrocks::vectorized::FileScanNode::_scanner_scan()
@ 0x238d94f starrocks::vectorized::FileScanNode::_scanner_worker()
@ 0x4fe0870 execute_native_thread_routine
@ 0x7f3d19321ea5 start_thread
@ 0x7f3d187269fd __clone
@ 0x0 (unknown)
start time: Thu Jun 2 13:16:26 CST 2022
*** Aborted at 1654147017 (unix time) try “date -d @1654147017” if you are using GNU date ***
PC: @ 0x50164aa __libc_malloc
*** SIGSEGV (@0x0) received by PID 19889 (TID 0x7fa050360700) from PID 0; stack trace: ***
@ 0x3503022 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fa0a1b88630 (unknown)
@ 0x50164aa __libc_malloc
@ 0x1ceab18 malloc
@ 0x4f66c75 operator new()
@ 0x20d41cf starrocks::vectorized::NullableColumn::reserve()
@ 0x20caf34 starrocks::vectorized::Chunk::clone_empty_with_slot()
@ 0x20cb423 starrocks::vectorized::Chunk::clone_empty_with_slot()
@ 0x2103834 starrocks::stream_load::NodeChannel::add_chunk()
@ 0x21040b0 starrocks::stream_load::OlapTableSink::_send_chunk_by_node()
@ 0x2109d6d starrocks::stream_load::OlapTableSink::send_chunk()
@ 0x1cb0e98 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
@ 0x1cb18e7 starrocks::PlanFragmentExecutor::open()
@ 0x1c4c4e2 starrocks::FragmentExecState::execute()
@ 0x1c50d8c starrocks::FragmentMgr::exec_actual()
@ 0x1c515b1 _ZNSt17_Function_handlerIFvvEZN9starrocks11FragmentMgr18exec_plan_fragmentERKNS1_23TExecPlanFragmentParamsERKSt8fu
nctionIFvPNS1_20PlanFragmentExecutorEEESC_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x1d786c9 starrocks::ThreadPool::dispatch_thread()
@ 0x1d74288 starrocks::thread::supervise_thread()
@ 0x7fa0a1b80ea5 start_thread
@ 0x7fa0a0f859fd __clone
@ 0x0 (unknown)
start time: Thu Jun 2 13:27:47 CST 2022
*** Aborted at 1654147708 (unix time) try “date -d @1654147708” if you are using GNU date ***
PC: @ 0x4ea75d3 tcmalloc::ThreadCache::ReleaseToCentralCache()
*** SIGSEGV (@0x0) received by PID 21361 (TID 0x7f7466b60700) from PID 0; stack trace: ***
@ 0x3503022 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f74b84e5630 (unknown)
@ 0x4ea75d3 tcmalloc::ThreadCache::ReleaseToCentralCache()
@ 0x4ea7955 tcmalloc::ThreadCache::ListTooLong()
@ 0x173da35 std::vector<>::_M_default_append()
@ 0x174a0a0 starrocks::vectorized::BinaryColumn::append_selective()
@ 0x20c9bca starrocks::vectorized::Chunk::append_selective()
@ 0x2103a6f starrocks::stream_load::NodeChannel::add_chunk()
@ 0x21040b0 starrocks::stream_load::OlapTableSink::_send_chunk_by_node()
@ 0x2109d6d starrocks::stream_load::OlapTableSink::send_chunk()
@ 0x1cb0e98 starrocks::PlanFragmentExecutor::_open_internal_vectorized()
@ 0x1cb18e7 starrocks::PlanFragmentExecutor::open()
@ 0x1c4c4e2 starrocks::FragmentExecState::execute()
@ 0x1c50d8c starrocks::FragmentMgr::exec_actual()
@ 0x1c515b1 _ZNSt17_Function_handlerIFvvEZN9starrocks11FragmentMgr18exec_plan_fragmentERKNS1_23TExecPlanFragmentParamsERKSt8fu
nctionIFvPNS1_20PlanFragmentExecutorEEESC_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x1d786c9 starrocks::ThreadPool::dispatch_thread()
@ 0x1d74288 starrocks::thread::supervise_thread()
@ 0x7f74b84ddea5 start_thread
@ 0x7f74b78e29fd __clone
@ 0x0 (unknown)

我们这个集群版本是2.13版本的?

version info
Version: 2.1.3
Git: 0881cb2
Build Info: None@abb23110e87c
Build Time: 2022-03-18 12:29:07

有4个BE 节点,两个节点正常,两个节点异常,当启动第三个BE节点 时 会保持正常状态,但是当启动第四个BE节点时,等待一段时间,会有两个节点crash 。 crash 节点具有随机性

阿西吧 ,三个节点开启没问题,开启第四个节点必定会挂2个节点

系统存在大量的不健康的Tablet

show proc ‘/backends’\G;
所有节点正常
MySQL [(none)]> show proc ‘/backends’\G;
*************************** 1. row ***************************
BackendId: 10003
Cluster: default_cluster
IP: 10.240.102.8
HostName: xxx
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-03-22 16:11:39
LastHeartbeat: NULL
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 1293
DataUsedCapacity: .000
AvailCapacity: 466.934 GB
TotalCapacity: 492.024 GB
UsedPct: 5.10 %
MaxDiskUsedPct: 5.10 %
ErrMsg:
Version:
Status: {“lastSuccessReportTabletsTime”:“N/A”}
DataTotalCapacity: 466.934 GB
DataUsedPct: 0.00 %
*************************** 2. row ***************************
BackendId: 10007
Cluster: default_cluster
IP: 10.240.102.7
HostName: xxxx
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-03-22 16:11:49
LastHeartbeat: NULL
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 21507
DataUsedCapacity: .000
AvailCapacity: 466.934 GB
TotalCapacity: 492.024 GB
UsedPct: 5.10 %
MaxDiskUsedPct: 5.10 %
ErrMsg:
Version:
Status: {“lastSuccessReportTabletsTime”:“N/A”}
DataTotalCapacity: 466.934 GB
DataUsedPct: 0.00 %
*************************** 3. row ***************************
BackendId: 10008
Cluster: default_cluster
IP: 10.240.102.5
HostName: xxxx
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-03-22 16:11:59
LastHeartbeat: NULL
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 21750
DataUsedCapacity: .000
AvailCapacity: 466.934 GB
TotalCapacity: 492.024 GB
UsedPct: 5.10 %
MaxDiskUsedPct: 5.10 %
ErrMsg:
Version:
Status: {“lastSuccessReportTabletsTime”:“N/A”}
DataTotalCapacity: 466.934 GB
DataUsedPct: 0.00 %
*************************** 4. row ***************************
BackendId: 10049
Cluster: default_cluster
IP: 10.240.102.9
HostName: xxxx
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-03-22 16:13:29
LastHeartbeat: NULL
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 20954
DataUsedCapacity: .000
AvailCapacity: 466.934 GB
TotalCapacity: 492.024 GB
UsedPct: 5.10 %
MaxDiskUsedPct: 5.10 %
ErrMsg:
Version:
Status: {“lastSuccessReportTabletsTime”:“N/A”}
DataTotalCapacity: 466.934 GB
DataUsedPct: 0.00 %
4 rows in set (0.00 sec)

ERROR: No query specified

ERROR: No query specified

SR 的 管理平台显示有节点down 机 ,魔幻啊~

您好,刚才确认了一下这个问题,再2.1的最新版本已经修复过了。建议您做下升级呢。