startRocks be启动失败以及查询挂掉be

【详述】starRocks be启动失败以及查询挂掉be
【背景】启动一台be失败
【业务影响】
【StarRocks版本】2.5.3
【集群规模】例如:2fe + 3 be
【机器信息】x86_64 16 cpu, mem 32G
【联系方式】starRocks社区群3, @ θ
【附件】
ldd --version
ldd (GNU libc) 2.28

OS:
BigCloud Enterprise Linux For Euler release 21.10 (LTS-SP2)

be.INFO
I0620 08:42:47.104465 1205922 txn_manager.cpp:285] Commit txn successfully. tablet: 10208, txn_id: 2770, rowsetid: 02000000000005133943c0eb2dd73034f1b1f54563526bb8 #segment:0 #delfile:0
I0620 08:42:47.104466 1205922 data_dir.cpp:337] Added committed rowset=02000000000005133943c0eb2dd73034f1b1f54563526bb8 tablet=10208 schema hash=1612420876 txn_id: 2770
I0620 08:42:47.110098 1205988 fragment_mgr.cpp:516] FragmentMgr cancel worker start working.
I0620 08:42:47.116009 1205837 exec_env.cpp:173] [PIPELINE] Exec thread pool: thread_num=16
I0620 08:42:47.179982 1206390 runtime_filter_worker.cpp:760] RuntimeFilterWorker start working.
I0620 08:42:47.180194 1206392 profile_report_worker.cpp:99] ProfileReportWorker start working.
I0620 08:42:47.180380 1206393 result_buffer_mgr.cpp:132] result buffer manager cancel thread begin.
I0620 08:42:47.182157 1205837 load_path_mgr.cpp:55] Load path configured to [/opt/cluster/data/mpp/storage/mini_download]
I0620 08:42:47.191814 1206495 compaction_manager.cpp:57] start compaction scheduler
I0620 08:42:47.192272 1206497 storage_engine.cpp:609] start to check compaction
I0620 08:42:47.193277 1206503 olap_server.cpp:667] begin to do tablet meta checkpoint:/opt/cluster/data/mpp/storage
I0620 08:42:47.193517 1206506 olap_server.cpp:617] try to perform path gc by tablet!
I0620 08:42:47.193552 1205837 olap_server.cpp:208] All backgroud threads of storage engine have started.
I0620 08:42:47.194677 1205837 thrift_server.cpp:375] heartbeat has started listening port on 9050
I0620 08:42:47.194700 1205837 backend_base.cpp:66] StarRocksInternalService has started listening port on 9060
I0620 08:42:47.194977 1205837 thrift_server.cpp:375] BackendService has started listening port on 9060
I0620 08:42:47.201607 1205837 server.cpp:1070] Server[starrocks::BackendInternalServiceImplstarrocks::PInternalService+starrocks::LakeServiceImpl+starrocks::BackendInternalServiceImpldoris::PBackendService] is serving on port=8060.
I0620 08:42:47.201661 1205837 server.cpp:1073] Check out h
每天的错误不太一样

be.out
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1687221767 (unix time) try “date -d @1687221767” if you are using GNU date ***
PC: @ 0x7fa9f0e5c60b gsignal
*** SIGABRT (@0x3e80012664d) received by PID 1205837 (TID 0x7fa9f0e05fc0) from PID 1205837; stack trace: ***
@ 0x5769222 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fa9f11834c0 (unknown)
@ 0x7fa9f0e5c60b gsignal
@ 0x7fa9f0e5d931 abort
@ 0x2a31bdc _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
@ 0x7c308b6 __cxxabiv1::__terminate()
@ 0x7c30921 std::terminate()
@ 0x7c30a74 __cxa_throw
@ 0x2a33ce0 std::__throw_system_error()
@ 0x7cabc59 std::thread::_M_start_thread()
@ 0x4cb50d0 starrocks::EvHttpServer::start()
@ 0x4740d7a starrocks::HttpServiceBE::start()
@ 0x473e91c start_be()
@ 0x2a37450 main
@ 0x7fa9f0e48b27 __libc_start_main
@ 0x2b5172f (unknown)
@ 0x0 (unknown)

be.WARNING
W0620 08:42:47.262043 1205837 stack_util.cpp:128] 2023-06-20 08:42:47.262009, query_id=00000000-0000-0000-0000-000000000000, fragment_instance_id=0000000

dmsg -T
[二 6月 20 08:44:05 2023] audit: type=1110 audit(1687221841.941:1287572): pid=1206665 uid=0 auid=1002 ses=176348 msg=‘op=PAM:setcred grantors=pam_env,pam_faillock,pam_unix acct=“aspmon” exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success’
[二 6月 20 08:44:05 2023] audit: type=1105 audit(1687221841.943:1287573): pid=1206666 uid=0 auid=1002 ses=176349 msg=‘op=PAM:session_open grantors=pam_loginuid,pam_keyinit,pam_limits,pam_systemd acct=“aspmon” exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success’
[二 6月 20 08:44:05 2023] audit: type=1110 audit(1687221841.944:1287574): pid=1206666 uid=0 auid=1002 ses=176349 msg=‘op=PAM:setcred grantors=pam_env,pam_faillock,pam_unix acct=“aspmon” exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success’

问题二:
380W数据,从数据来源ES外表 使用insert into select 语句插入到内表,6个字段的DDL,有按照时间字段分区和分桶
select colum1, count(1) from table_1 group by colum1; —没有问题
select count(1) from table_1 ; —没有问题
select * from table_1 limit 10000, 10 ; —没有问题
select * from table_1; —直接挂掉be
select * from table_1 order by time_type_column limit 10000, 10; – 直接挂掉be

可能是线程数到操作系统限制了,也可能是链接操作系统的thread库有问题:

可以ldd lib/starrocks_be看下

然后 /proc/be进程号/limits看下

ulimit -a 看下

ldd lib/starrocks_be
linux-vdso.so.1 (0x00007ffdaf030000)
libunwind.so.8 => not found
liblzma.so.5 => /usr/lib64/liblzma.so.5 (0x00007f622558c000)
libjvm.so => not found
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f622556b000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f6225566000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f62253e3000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f62253d6000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f622521e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f62255c8000)

/proc/be进程号/limits
进程挂的太快,来不及。

ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126468
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 131072
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

强制踢出集群,删除storage和meta,重启后是OK的@流木 ,至少能够起来,某个业务olap表不可用,删除表后,重新加入集群OK

mysql> ALTER SYSTEM DROP BACKEND “10.1.5.111:9050”;
ERROR 1064 (HY000): Unexpected exception: Tables such as [statistics.table_statistic_v1] on the backend[10.1.5.111:9050] have only one replica. To avoid data loss, please change the replication_num of [statistics.table_statistic_v1] to three. ALTER SYSTEM DROP BACKEND FORCE can be used to forcibly drop the backend.
mysql> ALTER SYSTEM DROP BACKEND “10.1.5.111:9050” FORCE;

@流木 应该是那台机器上的storage_path里面的东西损坏了,我重新DDL、外表导入数据,查询都OK用昨天群里发的sql,只是索引的时间范围我减少到从
START(“2019-01-01”) END(“2032-01-01”) EVERY(INTERVAL 1 month) --> START(“2019-01-01”) END(“2024-01-01”) EVERY(INTERVAL 1 month)