3.2.3 be 全部突然宕机

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
【背景】做过哪些操作?
【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.2.3
【集群规模】例如:3fe(1 follower+2observer)+5be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】

  • fe.log/beINFO/相应截图
  • 慢查询:
    • Profile信息
    • 并行度:show variables like ‘%parallel_fragment_exec_instance_num%’;
    • pipeline是否开启:show variables like ‘%pipeline%’;
    • be节点cpu和内存使用率截图
  • 查询报错:
  • be crash
  • 外表查询报错
    • be.out和fe.warn.log

be节点突然全部不存活,集群内存跟cpu都正常

这是be.out的全部日志,query_id 查不到对应的日志

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
3.2.3 RELEASE (build a40e2f8)
query_id:86ffbfd8-41f7-11ef-8cc8-00163e383286, fragment_instance:86ffbfd8-41f7-11ef-8cc8-00163e383287
tracker:process consumption: 68865859608
tracker:query_pool consumption: -55023504
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 2158929800
tracker:metadata consumption: -4715889408982
tracker:tablet_metadata consumption: -4721074307786
tracker:rowset_metadata consumption: 112395045
tracker:segment_metadata consumption: 807150115
tracker:column_metadata consumption: 4265353644
tracker:tablet_schema consumption: -4721160841994
tracker:segment_zonemap consumption: 97937207
tracker:short_key_index consumption: 693334376
tracker:column_zonemap_index consumption: 1590688452
tracker:ordinal_index consumption: 2380019408
tracker:bitmap_index consumption: 136400
tracker:bloom_filter_index consumption: 2088
tracker:compaction consumption: 285008328
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 43248268688
tracker:update consumption: 10650547063
tracker:chunk_allocator consumption: 2146621192
tracker:clone consumption: 0
tracker:consistency consumption: 38523456
tracker:datacache consumption: 0
tracker:replication consumption: 64001776
*** Aborted at 1720971684 (unix time) try “date -d @1720971684” if you are using GNU date ***
PC: @ 0x2c39b4d starrocks::DecimalV3Column<>::put_mysql_row_buffer()
*** SIGSEGV (@0x8) received by PID 6303 (TID 0x2b3d6a763700) from PID 8; stack trace: ***
@ 0x67283e2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x2b3d38f77cab os::Linux::chained_handler()
@ 0x2b3d38f7c59c JVM_handle_linux_signal
@ 0x2b3d38f6f8f8 signalHandler()
@ 0x2b3d3965b5d0 (unknown)
@ 0x2c39b4d starrocks::DecimalV3Column<>::put_mysql_row_buffer()
@ 0x5860704 starrocks::MysqlResultWriter::process_chunk()
@ 0x37a1e0d starrocks::pipeline::ResultSinkOperator::push_chunk()
@ 0x3842949 starrocks::pipeline::PipelineDriver::process()
@ 0x383378e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x2e493aa starrocks::ThreadPool::dispatch_thread()
@ 0x2e43e0a starrocks::thread::supervise_thread()
@ 0x2b3d39653dd5 start_thread
@ 0x2b3d3a28d02d __clone
@ 0x0 (unknown)
start time: Sun Jul 14 23:45:59 CST 2024
start time: Sun Jul 14 23:46:08 CST 2024
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

今天早上出现sql 强校验的问题, set GLOBAL sql_mode=’’ 后,开始报错 Vectorized engine does not support the operator, node_type: 0 backend [id=15088] [host=xx]

重启对应的be节点不行,
然后重启所有的be 跟 fe节点,才恢复正常,
这种是啥原因

试试 set global cbo_push_down_distinct_below_window=false ?

或是升级到3.2.8+

这个参数就是false

已确定是迁移工具不同版本之间迁移导致,3.2.8已经Fix,需要删除这个表,并配置

set global enable_legacy_compatibility_for_replication=true

重新迁移可以规避。