1.环境:
【StarRocks版本】:2.2.5
【集群规模】3FE+3BE混布(3台服务器上同时还有3个ES节点)
【Kafka版本】kafka_2.11-2.2.1
2.问题描述
StarRocks集群,3个BE启动之后,运行一会就会挂掉2个BE,重启之后,过一会还是会挂掉2个,挂掉的节点每次可能不一样。通过dmesg -T没有发现存在OOM的情况,且端口和网络情况也无异常。
尝试删除FE元数据及BE数据之后重新安装,安装完成之后,在不做其他操作的情况下,观察近半小时3FE+3BE均是正常状态,但在新建一张表并且创建一个routine load任务之后不久(期间有数据写入表),就会复现挂掉2个BE的情况。
3.be.out日志
starrocks_be: rdbuf.c:1099: rd_slice_narrow_copy: Assertion `rd_slice_abs_offset(new_slice) <= new_slice->end’ failed.
*** Aborted at 1666253592 (unix time) try “date -d @1666253592” if you are using GNU date ***
PC: @ 0x7f328c0e8387 __GI_raise
*** SIGABRT (@0x1cc4) received by PID 7364 (TID 0x7f31d7787700) from PID 7364; stack trace: ***
@ 0x3cb75d2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f328cb9d630 (unknown)
@ 0x7f328c0e8387 __GI_raise
@ 0x7f328c0e9a78 __GI_abort
@ 0x7f328c0e11a6 __assert_fail_base
@ 0x7f328c0e1252 __GI___assert_fail
@ 0x4d17b4e rd_slice_narrow_copy
@ 0x4cf6e48 rd_kafka_msgset_reader_msg_v0_1
@ 0x4cf0602 rd_kafka_msgset_reader_run
@ 0x4cfd323 rd_kafka_msgset_parse
@ 0x4c429c5 rd_kafka_fetch_reply_handle
@ 0x4c46c35 rd_kafka_broker_fetch_reply
@ 0x4c6a8a6 rd_kafka_buf_callback
@ 0x4c3ccab rd_kafka_recv
@ 0x4c67d98 rd_kafka_transport_io_event
@ 0x4c68a3b rd_kafka_transport_io_serve
@ 0x4c4ab7c rd_kafka_broker_ops_io_serve
@ 0x4c4b4c0 rd_kafka_broker_consumer_serve
@ 0x4c4d120 rd_kafka_broker_serve
@ 0x4c4d73d rd_kafka_broker_thread_main
@ 0x4cd9e78 _thrd_wrapper_function
@ 0x7f328cb95ea5 start_thread
@ 0x7f328c1b0b0d __clone
@ 0x0 (unknown)
查看be.out日志,发现有与rd_kafka相关日志,论坛中也有相关的情况,https://forum.mirrorship.cn/t/topic/1712,
但我的报错信息中并不包含rd_kafka_broker_destroy_final。不知道是否还是因为kafka引起的BE进程崩溃。
附件中是三台BE节点的日志
BE.log.zip (187.8 KB)