早上5点be节点全部挂掉

版本:2.3.1
1台FE(16C、32G、60+1500),3台BE(16C、32G、60+1500)
已发生两次早上5点,三台BE节点全部挂掉
BE日志如下:
W0920 05:01:09.858412 8718 input_messenger.cpp:214] Fail to read from Socket{id=768 fd=573 addr=11.125.9.2:43502:8060} (0x651ec000): Connection reset by peer [104]
W0920 05:01:40.658473 8718 input_messenger.cpp:243] Close Socket{id=896 fd=578 addr=11.125.9.2:35574:8060} (0xa1600000) due to unknown message: \03\00\00*%\E0\00\00\00\00\00Cookie: mstshash=nmap\r\n\01\00\b
\00\03\00\00\00
W0920 05:01:45.662808 8722 input_messenger.cpp:214] Fail to read from Socket{id=1024 fd=578 addr=11.125.9.2:35646:8060} (0x2888e000): Connection reset by peer [104]
W0920 05:01:50.668354 8724 input_messenger.cpp:214] Fail to read from Socket{id=897 fd=578 addr=11.125.9.2:35784:8060} (0xa1600200): Connection reset by peer [104]
W0920 05:01:55.669509 8724 input_messenger.cpp:214] Fail to read from Socket{id=769 fd=578 addr=11.125.9.2:35874:8060} (0x651ec200): Connection reset by peer [104]
W0920 05:01:55.843670 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//nice%20ports%2C/Tri%6Eity.txt%2ebak
W0920 05:02:00.674520 8730 input_messenger.cpp:214] Fail to read from Socket{id=770 fd=584 addr=11.125.9.2:35928:8060} (0x651ec400): Connection reset by peer [104]
W0920 05:02:00.853807 9069 http_request.cpp:70] parse query str failed, query=CAVIT
W0920 05:02:05.675881 8724 input_messenger.cpp:214] Fail to read from Socket{id=8589935488 fd=578 addr=11.125.9.2:35988:8060} (0xa1600000): Connection reset by peer [104]
W0920 05:02:05.678467 8730 input_messenger.cpp:243] Close Socket{id=8589935362 fd=577 addr=11.125.9.2:36032:8060} (0x651ec400) due to unknown message: \80\00\00(r\FE\1D\13\00\00\00\00\00\00\00\02\00\01\86\A
0\00\01\97|\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00
W0920 05:02:05.679304 8722 input_messenger.cpp:243] Close Socket{id=17179869954 fd=577 addr=11.125.9.2:36034:8060} (0x651ec400) due to unknown message: \00\1E\00\06\01\00\00\01\00\00\00\00\00\00\07version\0
4bind\00\00\10\00\03
W0920 05:02:13.180601 8722 input_messenger.cpp:214] Fail to read from Socket{id=8589935363 fd=577 addr=11.125.9.2:36036:8060} (0x651ec600): Connection reset by peer [104]
W0920 05:02:13.181094 8718 input_messenger.cpp:243] Close Socket{id=772 fd=577 addr=11.125.9.2:36090:8060} (0x651ec800) due to unknown message: \16\03\00\00S\01\00\00O\03\00?G\D7\F7\BA,\EE\EA\B2`~\F3\00\FD
82{\B9\D5\96\C8w\9B\E6\C4\DB<=\DBo\EF\10n\00\00(\00\16\00\13\00\n\00f\00\05\00\04\00e\00d\00c…<skipping 24 bytes>
W0920 05:02:13.181953 8730 input_messenger.cpp:243] Close Socket{id=8589935364 fd=577 addr=11.125.9.2:36092:8060} (0x651ec800) due to unknown message: \00\00\00\A4\FFSMBr\00\00\00\00\b\01@\00\00\00\00\00\00
\00\00\00\00\00\00\00\00@\06\00\00\01\00\00\81\00\02PC NETWORK PROGRAM 1.0\00\02…<skipping 104 bytes>
W0920 05:02:18.185601 8724 input_messenger.cpp:214] Fail to read from Socket{id=17179869955 fd=577 addr=11.125.9.2:36094:8060} (0x651ec600): Connection reset by peer [104]
W0920 05:02:23.188439 8721 input_messenger.cpp:214] Fail to read from Socket{id=25769804547 fd=577 addr=11.125.9.2:36212:8060} (0x651ec600): Connection reset by peer [104]
W0920 05:02:23.188880 8715 input_messenger.cpp:243] Close Socket{id=34359739139 fd=577 addr=11.125.9.2:36242:8060} (0x651ec600) due to unknown message: OPTIONS sip:nm SIP/2.0\r\nVia: SIP/2.0/TCP nm;branch=f
oo\r\nFrom: <s…<skipping 159 bytes>
W0920 05:02:23.189824 8723 input_messenger.cpp:243] Close Socket{id=1152 fd=577 addr=11.125.9.2:36244:8060} (0x886fe000) due to unknown message: \00Z\00\00\01\00\00\00\016\01,\00\00\b\00\7F\FF\7F\b\00\00\00
\01\00 \00:\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\004\E6\00\00\00\01\00\00\00\00\00\00\00\00(CONNE…<skipping 26 bytes>
W0920 05:02:23.190706 8717 input_messenger.cpp:243] Close Socket{id=1153 fd=577 addr=11.125.9.2:36246:8060} (0x886fe200) due to unknown message: \05\04\00\01\02\80\05\01\00\03\ngoogle.com\00PGET / HTTP/1.0
r\n\r\n
W0920 05:02:28.191862 8715 input_messenger.cpp:214] Fail to read from Socket{id=513 fd=577 addr=11.125.9.2:36248:8060} (0x2be62200): Connection reset by peer [104]
W0920 05:03:04.221485 8725 input_messenger.cpp:243] Close Socket{id=8589935746 fd=600 addr=11.125.9.2:36960:8060} (0x886fe400) due to unknown message: Secure * Secure-HTTP/1.4\r\nHost: 11.66.5.195:8060\r\nC
onnection: cl…<skipping 7 bytes>

W0920 05:03:04.221810 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//plugins/servlet/oauth/users/icon-uri
W0920 05:03:04.249600 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//main.php
W0920 05:03:04.251614 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//phpMyAdmin/main.php
W0920 05:03:04.253024 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//phpmyadmin/main.php
W0920 05:03:04.254456 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/main.php
W0920 05:03:04.255770 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/main.php
W0920 05:03:04.269769 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//websvn
W0920 05:03:04.382985 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//artifactory/webapp/
W0920 05:03:07.195325 9068 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//phpinfo.php
W0920 05:03:09.231628 9068 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//phptax/drawimage.php
W0920 05:03:09.255450 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//jmx
W0920 05:03:09.270431 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/ash
W0920 05:03:09.273001 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/bash
W0920 05:03:09.274369 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/csh
W0920 05:03:09.275703 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/ksh
W0920 05:03:09.277034 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/sh
W0920 05:03:09.278388 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/tcsh
W0920 05:03:09.279755 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/zsh
W0920 05:03:09.281106 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/ash
W0920 05:03:09.282382 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/bash
W0920 05:03:09.283702 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/csh
W0920 05:03:09.285045 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/ksh
W0920 05:03:09.286330 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/sh
W0920 05:03:09.287725 9068 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/tcsh
W0920 05:03:09.289157 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/zsh
W0920 05:03:09.426956 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//rest/api/latest/groupuserpicker
W0920 05:03:09.620805 9068 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//net/net/net.html
W0920 05:03:09.626935 9069 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//cgi-bin/apexec.pl
W0920 05:03:09.633863 9067 utils.cpp:124] Failed to open file: /data/starrocks/StarRocks-2.3.1/be/www//scripts/apexec.pl

FE日志如下:
2022-09-23 05:01:39,603 ERROR (thrift-server-pool-63|29263) [SRTThreadPoolServer$WorkerProcess.run():318] Thrift Error occurred during processing of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.SRTThreadPoolServer$WorkerProcess.run(SRTThreadPoolServer.java:310) [starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_202]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_202]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_202]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_202]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_202]
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0]
… 9 more

服务器日志:
Sep 23 05:01:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374429 of user root.
Sep 23 05:01:41 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29357]: Did not receive identification string from 11.125.9.2 port 55978
Sep 23 05:02:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374432 of user root.
Sep 23 05:02:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374431 of user root.
Sep 23 05:03:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374433 of user root.
Sep 23 05:03:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374434 of user root.
Sep 23 05:03:14 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29624]: Connection closed by 11.125.9.2 port 34738 [preauth]
Sep 23 05:03:15 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29627]: dispatch_protocol_error: type 52 seq 4 [preauth]
Sep 23 05:03:15 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29627]: dispatch_protocol_error: type 90 seq 6 [preauth]
Sep 23 05:03:15 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29629]: Connection closed by 11.125.9.2 port 35234 [preauth]
Sep 23 05:03:17 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29627]: Connection closed by 11.125.9.2 port 35192 [preauth]
Sep 23 05:03:22 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29640]: Did not receive identification string from 11.125.9.2 port 36216
Sep 23 05:04:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374435 of user root.
Sep 23 05:04:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374436 of user root.
Sep 23 05:04:09 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29817]: Did not receive identification string from 11.125.9.2 port 50986
Sep 23 05:04:14 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29827]: Invalid user #does_not_exists# from 11.125.9.2 port 52702
Sep 23 05:04:14 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29827]: input_userauth_request: invalid user #does_not_exists# [preauth]
Sep 23 05:04:14 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29827]: fatal: ssh_packet_get_string: incomplete message [preauth]
Sep 23 05:04:15 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29829]: Did not receive identification string from 11.125.9.2 port 52998
Sep 23 05:04:15 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29831]: Bad protocol version identification ‘expr 90791 \* 36684’ from 11.125.9.2 port 31334
Sep 23 05:04:23 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29846]: Protocol major versions differ for 11.125.9.2 port 55102: SSH-2.0-OpenSSH_7.4 vs. SSH-1.33-TEST
Sep 23 05:04:23 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[29847]: Protocol major versions differ for 11.125.9.2 port 55112: SSH-2.0-OpenSSH_7.4 vs. SSH-1.5-TEST
Sep 23 05:05:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374437 of user root.
Sep 23 05:05:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374438 of user root.
Sep 23 05:05:29 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 sshd[30148]: Did not receive identification string from 11.125.9.2 port 36216
Sep 23 05:06:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374440 of user root.
Sep 23 05:06:01 Server-eb855dbc-44e0-4a42-8de6-c77a219d8351 systemd: Started Session 374439 of user root.

图片涉及到您个人信息,建议您做下处理,可以发文本格式的日志信息

挂掉的be的be.out文件发下可以吗?


你这个是自己编译的包吗还是官网的包?

start time: Tue Sep 20 08:35:09 CST 2022
tcmalloc: large alloc 1195728896 bytes == 0x15c110000 @ 0x57cdbbf 0x5a5f21c 0x207da18 0x59af6b5
tcmalloc: large alloc 1246912512 bytes == 0x1b8804000 @ 0x57cdbbf 0x5a5f21c 0x207da18 0x59af6b5
tcmalloc: large alloc 1811947520 bytes == 0x1a3566000 @ 0x57cdbbf 0x5a5f21c 0x207da18 0x59af6b5
tcmalloc: large alloc 1811947520 bytes == 0x1a3566000 @ 0x57cdbbf 0x5a5f21c 0x207da18 0x59af6b5
terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc
terminate called recursively
terminate called recursively
*** Aborted at 1663880580 (unix time) try “date -d @1663880580” if you are using GNU date ***
PC: @ 0x7f89d4848387 __GI_raise
*** SIGABRT (@0x3e900000c0c) received by PID 3084 (TID 0x7f88ca89a700) from PID 3084; stack trace: ***
@ 0x3fab972 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f89d52fd630 (unknown)
@ 0x7f89d4848387 __GI_raise
@ 0x7f89d4849a78 __GI_abort
@ 0x188983d _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
@ 0x59af166 __cxxabiv1::__terminate()
@ 0x59af1d1 std::terminate()
@ 0x59af324 __cxa_throw
@ 0x1889744 _Znwm.cold
@ 0x5a2696a std::__cxx11::basic_string<>::_M_mutate()
@ 0x5a27390 std::__cxx11::basic_string<>::_M_replace_aux()
@ 0x1fc646d apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
@ 0x1fc661c apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
@ 0x21ab5a9 apache::thrift::TDispatchProcessor::process()
@ 0x3f94168 apache::thrift::server::TConnectedClient::run()
@ 0x3f8c664 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
@ 0x3f8ee6d apache::thrift::concurrency::thread::threadMain()
@ 0x3f74476 std::thread::_State_impl<>::_M_run()
@ 0x5a292d0 execute_native_thread_routine
@ 0x7f89d52f5ea5 start_thread
@ 0x7f89d4910b0d __clone
@ 0x0 (unknown)

官网下的包2.3.1

dmesg -T | tail -10 看下,应该是OOM了

@ 0x21ab5a9 apache::thrift::TDispatchProcessor::process()
@ 0x3f94168 apache::thrift::server::TConnectedClient::run()
@ 0x3f8c664 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
@ 0x3f8ee6d apache::thrift::concurrency::thread::threadMain()
@ 0x3f74476 std::thread::_State_impl<>::_M_run()
@ 0x5a292d0 execute_native_thread_routine
@ 0x7f89d52f5ea5 start_thread
@ 0x7f89d4910b0d __clone
@ 0x0 (unknown)

somaxconn这个值您设置的多少,日志里面多是连接被拒信息,怀疑有可能是系统的backlog打满了

也麻烦您帮忙发下be.conf文件看下

be里没有怎么修改

INFO, WARNING, ERROR, FATAL

sys_log_level = INFO

ports for admin, web, heartbeat service

be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060

Choose one if there are more than one ip except loopback address.

Note that there should at most one ip match this list.

If no ip match this rule, will choose one randomly.

use CIDR format, e.g. 10.10.10.0/24

Default value is empty.

priority_networks = 192.168.47.243

data root path, separate by ‘;’

you can specify the storage medium of each root path, HDD or SSD, seperate by ‘,’

eg:

storage_root_path = /data1,medium:HDD;/data2,medium:SSD;/data3

/data1, HDD;

/data2, SSD;

/data3, HDD(default);

Default value is ${STARROCKS_HOME}/storage, you should create it by hand.

storage_root_path = ${STARROCKS_HOME}/storage

Advanced configurations

sys_log_dir = ${STARROCKS_HOME}/log

sys_log_roll_mode = SIZE-MB-1024

sys_log_roll_num = 10

sys_log_verbose_modules = *

log_buffer_level = -1

default_rowset_type = beta

可以一起聊下这个问题吗?

已确定原因,正在修复。

请问是啥原因呢?我也遇到了往be 8060端口发送无法识别的包,会触发系统OOM的问题。