3.2.8存算一体BE crash

版本3.2.8,存算一体架构,3FE,6BE
虽然有query_id,但无法找到对应sql
be.out输出:

tracker:process consumption: 115392331320
tracker:query_pool consumption: 9507027568
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 1420725835
tracker:tablet_metadata consumption: 210271161
tracker:rowset_metadata consumption: 106819702
tracker:segment_metadata consumption: 100677053
tracker:column_metadata consumption: 1002957919
tracker:tablet_schema consumption: 8537673
tracker:segment_zonemap consumption: 34496999
tracker:short_key_index consumption: 55909279
tracker:column_zonemap_index consumption: 187255199
tracker:ordinal_index consumption: 595358008
tracker:bitmap_index consumption: 785952
tracker:bloom_filter_index consumption: 90064
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 92675550160
tracker:update consumption: 4987002921
tracker:chunk_allocator consumption: 2148693472
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1718853165 (unix time) try "date -d @1718853165" if you are using GNU date ***
PC: @          0x4fcb438 starrocks::serde::(anonymous namespace)::read_raw()
*** SIGSEGV (@0x0) received by PID 49583 (TID 0x7f7b8404d700) from PID 0; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f7d53d1005b os::Linux::chained_handler()
    @     0x7f7d53d1512b JVM_handle_linux_signal
    @     0x7f7d53d088c8 signalHandler()
    @     0x7f7d52ea35d0 (unknown)
    @          0x4fcb438 starrocks::serde::(anonymous namespace)::read_raw()
    @          0x4fd1187 starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba8b1c starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7f7d52e9bdd5 start_thread
    @     0x7f7d5229dead __clone
    @                0x0 (unknown)

好的 我们跟进下

:joy: 现在6个BE节点随机宕,并且每次报错信息还不一样,3.2.8是不是有什么缺陷啊

3.2.8 RELEASE (build 759cc78)
query_id:aec97081-36f1-11ef-97ba-347379ae3e49, fragment_instance:aec97081-36f1-11ef-97ba-347379ae3e4b
tracker:process consumption: 109207104472
tracker:query_pool consumption: 8995526728
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 1458169397
tracker:tablet_metadata consumption: 211570292
tracker:rowset_metadata consumption: 74345750
tracker:segment_metadata consumption: 121462981
tracker:column_metadata consumption: 1050790374
tracker:tablet_schema consumption: 8727356
tracker:segment_zonemap consumption: 37034019
tracker:short_key_index consumption: 73781990
tracker:column_zonemap_index consumption: 212171542
tracker:ordinal_index consumption: 614978320
tracker:bitmap_index consumption: 930992
tracker:bloom_filter_index consumption: 108560
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 87818387888
tracker:update consumption: 3463229693
tracker:chunk_allocator consumption: 2019648928
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1719759712 (unix time) try "date -d @1719759712" if you are using GNU date ***
PC: @          0x2c3512c starrocks::BinaryColumnBase<>::append()
*** SIGSEGV (@0x200000072) received by PID 144648 (TID 0x7fd22f11a700) from PID 114; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fd3ef8cfe9b os::Linux::chained_handler()
    @     0x7fd3ef8d4a5d JVM_handle_linux_signal
    @     0x7fd3ef8c7858 signalHandler()
    @     0x7fd3eedaf5d0 (unknown)
    @          0x2c3512c starrocks::BinaryColumnBase<>::append()
    @          0x345dc9e starrocks::NullableColumn::append()
    @          0x3545833 starrocks::JoinHashTable::append_chunk()
    @          0x3a85d3c starrocks::HashJoinBuilder::append_chunk()
    @          0x3a7feec starrocks::HashJoiner::append_chunk_to_ht()
    @          0x38b7a09 starrocks::pipeline::HashJoinBuildOperator::push_chunk()
    @          0x3868b53 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7fd3eeda7dd5 start_thread
    @     0x7fd3ee1a9ead __clone
    @                0x0 (unknown)
start time: Sun Jun 30 23:02:16 CST 2024, server uptime:  23:02:16 up 16 days,  6:39,  0 users,  load average: 6.22, 7.25, 7.69
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data1/starrocks/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data1/starrocks/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
3.2.8 RELEASE (build 759cc78)
query_id:87d51b84-3756-11ef-839e-347379a56d0c, fragment_instance:87d51b84-3756-11ef-839e-347379a56d0e
tracker:process consumption: 166374386796
tracker:query_pool consumption: 64621517552
tracker:query_pool/connector_scan consumption: 27590656
tracker:load consumption: 55546732
tracker:metadata consumption: 1389981689
tracker:tablet_metadata consumption: 212628807
tracker:rowset_metadata consumption: 77893673
tracker:segment_metadata consumption: 108985578
tracker:column_metadata consumption: 990473631
tracker:tablet_schema consumption: 8755767
tracker:segment_zonemap consumption: 35166352
tracker:short_key_index consumption: 63755717
tracker:column_zonemap_index consumption: 175155295
tracker:ordinal_index consumption: 600089896
tracker:bitmap_index consumption: 649104
tracker:bloom_filter_index consumption: 92896
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 87809331760
tracker:update consumption: 4491298709
tracker:chunk_allocator consumption: 1884074104
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1719803027 (unix time) try "date -d @1719803027" if you are using GNU date ***
PC: @          0x6f8c511 svb_decode_avx_simple
*** SIGSEGV (@0x7fe48fbf3000) received by PID 198479 (TID 0x7ff936a2f700) from PID 18446744071826255872; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7ffac4e80e9b os::Linux::chained_handler()
    @     0x7ffac4e85a5d JVM_handle_linux_signal
    @     0x7ffac4e78858 signalHandler()
    @     0x7ffac43605d0 (unknown)
    @          0x6f8c511 svb_decode_avx_simple
    @          0x6f8c861 streamvbyte_decode
    @          0x4fcbba0 starrocks::serde::(anonymous namespace)::decode_integers<>()
    @          0x4fcfa17 starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba83bc starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7ffac4358dd5 start_thread
    @     0x7ffac375aead __clone
    @                0x0 (unknown)
start time: Mon Jul  1 11:04:17 CST 2024, server uptime:  11:04:17 up 16 days, 18:41,  0 users,  load average: 40.77, 53.21, 69.07
3.2.8 RELEASE (build 759cc78)
query_id:abf993d2-3756-11ef-ba19-347379ae3e3b, fragment_instance:abf993d2-3756-11ef-ba19-347379ae3e3d
tracker:process consumption: 60526278504
tracker:query_pool consumption: 30042518432
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 320873171
tracker:tablet_metadata consumption: 212571451
tracker:rowset_metadata consumption: 76575732
tracker:segment_metadata consumption: 4041820
tracker:column_metadata consumption: 27684168
tracker:tablet_schema consumption: 8749027
tracker:segment_zonemap consumption: 1954966
tracker:short_key_index consumption: 1623896
tracker:column_zonemap_index consumption: 6837336
tracker:ordinal_index consumption: 13302976
tracker:bitmap_index consumption: 101632
tracker:bloom_filter_index consumption: 35264
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 22351127760
tracker:update consumption: 3905731243
tracker:chunk_allocator consumption: 143986480
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1719803087 (unix time) try "date -d @1719803087" if you are using GNU date ***
PC: @          0x6f8c49f svb_decode_avx_simple
*** SIGSEGV (@0x7f97a8c00000) received by PID 18957 (TID 0x7f9c35e7d700) from PID 18446744072245739520; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f9dc1e8c5d0 (unknown)
    @          0x6f8c49f svb_decode_avx_simple
    @          0x6f8c861 streamvbyte_decode
    @          0x4fcbba0 starrocks::serde::(anonymous namespace)::decode_integers<>()
    @          0x4fcfa17 starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba83bc starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7f9dc1e84dd5 start_thread
    @     0x7f9dc1286ead __clone
    @                0x0 (unknown)
start time: Mon Jul  1 11:05:01 CST 2024, server uptime:  11:05:01 up 16 days, 18:41,  0 users,  load average: 46.62, 53.07, 68.34
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data1/starrocks/starrocks/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data1/starrocks/starrocks/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

从be.WARNING里找到对应时间点的query_id,这是网络通信断了?

W0701 11:03:39.983711   520 runtime_filter_worker.cpp:314] brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=400ms @10.133.58.79:8060
W0701 11:03:46.055248 199232 mem_hook.cpp:249] large memory alloc, query_id:87d51b84-3756-11ef-839e-347379a56d0c instance: 87d51b84-3756-11ef-839e-347379a56d0e acquire:1681221220 bytes, stack:
    @          0x2db8fad  malloc
    @          0x8b80ae5  operator new()
    @          0x4fcfa3b  starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba83bc  starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8  starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d  starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c  starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8  starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a  starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660  starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d  starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593  starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa  starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146  starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c  starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca  starrocks::Thread::supervise_thread()
    @     0x7ffac4358dd5  start_thread
    @     0x7ffac375aead  __clone
    @              (nil)  (unknown)

就在刚刚又挂了一个 :sob:
be.WARNING

E0702 09:44:08.188309 71781 scan_operator.cpp:422] scan fragment 9277f120-3814-11ef-839e-347379a56d11 driver 29 Scan tasks error: Cancelled: canceled state
W0702 09:44:28.165135 70839 mem_hook.cpp:249] large memory alloc, query_id:9e5ea9fd-3814-11ef-839e-347379a56d0c instance: 9e5ea9fd-3814-11ef-839e-347379a56d0e acquire:2424890960 bytes, stack:
    @          0x2db8fad  malloc
    @          0x8b80ae5  operator new()
    @          0x4fcfa3b  starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba83bc  starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8  starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d  starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c  starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8  starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a  starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660  starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d  starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593  starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa  starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146  starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c  starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca  starrocks::Thread::supervise_thread()
    @     0x7fc4cef2bdd5  start_thread
    @     0x7fc4ce32dead  __clone
    @              (nil)  (unknown)

be.out:

3.2.8 RELEASE (build 759cc78)
query_id:9e5ea9fd-3814-11ef-839e-347379a56d0c, fragment_instance:9e5ea9fd-3814-11ef-839e-347379a56d0e
tracker:process consumption: 119513104416
tracker:query_pool consumption: 15172436904
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 1415716345
tracker:tablet_metadata consumption: 212070182
tracker:rowset_metadata consumption: 116910991
tracker:segment_metadata consumption: 99270196
tracker:column_metadata consumption: 987464976
tracker:tablet_schema consumption: 8841470
tracker:segment_zonemap consumption: 35877730
tracker:short_key_index consumption: 53014895
tracker:column_zonemap_index consumption: 179814472
tracker:ordinal_index consumption: 588150768
tracker:bitmap_index consumption: 808256
tracker:bloom_filter_index consumption: 84016
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 92684544064
tracker:update consumption: 4072508566
tracker:chunk_allocator consumption: 2110698264
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1719884669 (unix time) try "date -d @1719884669" if you are using GNU date ***
PC: @          0x6f8c536 svb_decode_avx_simple
*** SIGSEGV (@0x7fbc89e99000) received by PID 70130 (TID 0x7fc34e95d700) from PID 18446744071728369664; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fc4cef335d0 (unknown)
    @          0x6f8c536 svb_decode_avx_simple
    @          0x6f8c861 streamvbyte_decode
    @          0x4fcbba0 starrocks::serde::(anonymous namespace)::decode_integers<>()
    @          0x4fcfa17 starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x2ba83bc starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd0b6d starrocks::ColumnVisitorMutableAdapter<>::visit()
    @          0x345e00c starrocks::ColumnFactory<>::accept_mutable()
    @          0x4fd08f8 starrocks::serde::ColumnArraySerde::deserialize()
    @          0x4fd380a starrocks::serde::ProtobufChunkDeserializer::deserialize()
    @          0x2da5660 starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
    @          0x2daa08d starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
    @          0x2d7c593 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
    @          0x37a44aa starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
    @          0x3868146 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7fc4cef2bdd5 start_thread
    @     0x7fc4ce32dead __clone
    @                0x0 (unknown)
start time: Tue Jul  2 09:45:01 CST 2024, server uptime:  09:45:01 up 17 days, 17:06,  0 users,  load average: 7.55, 9.94, 11.23

最近还有发生吗?set global low_cardinality_optimize_v2=false; 试试

还是会偶发性出现

能找下这个SQL: aec97081-36f1-11ef-97ba-347379ae3e49, 然后发下explain costs 吗?

是否有grouping sets

资源组吗?启用了
aec97081-36f1-11ef-97ba-347379ae3e49 这个sql的日志记录被刷掉了

goupings sets是一个SQL的语法

SQL暂时找不到,另外刚刚又挂了一次,并且这次报错信息又不一样

3.2.8 RELEASE (build 759cc78)
query_id:546f2436-4a25-11ef-a401-347379ae3e49, fragment_instance:546f2436-4a25-11ef-a401-347379ae3e4c
tracker:process consumption: 144371095264
tracker:query_pool consumption: 35080363400
tracker:query_pool/connector_scan consumption: 17965056
tracker:load consumption: 0
tracker:metadata consumption: 1575313274
tracker:tablet_metadata consumption: 214977665
tracker:rowset_metadata consumption: 157301216
tracker:segment_metadata consumption: 111023660
tracker:column_metadata consumption: 1092010733
tracker:tablet_schema consumption: 10428833
tracker:segment_zonemap consumption: 44459652
tracker:short_key_index consumption: 54180895
tracker:column_zonemap_index consumption: 208032381
tracker:ordinal_index consumption: 633261256
tracker:bitmap_index consumption: 1074880
tracker:bloom_filter_index consumption: 94240
tracker:compaction consumption: 39948936
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 92690016080
tracker:update consumption: 5778455378
tracker:chunk_allocator consumption: 1948126040
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1721870966 (unix time) try "date -d @1721870966" if you are using GNU date ***
PC: @          0x2c3cfd8 starrocks::BinaryColumnBase<>::append_selective()
*** SIGSEGV (@0x100000006) received by PID 167194 (TID 0x7f8171b25700) from PID 6; stack trace: ***
    @          0x67c6d22 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f832aa0ce9b os::Linux::chained_handler()
    @     0x7f832aa11a5d JVM_handle_linux_signal
    @     0x7f832aa04858 signalHandler()
    @     0x7f8329eec5d0 (unknown)
    @          0x2c3cfd8 starrocks::BinaryColumnBase<>::append_selective()
    @          0x345d9dd starrocks::NullableColumn::append_selective()
    @          0x344219a starrocks::Chunk::append_selective()
    @          0x3aee96b starrocks::pipeline::ExchangeSinkOperator::Channel::add_rows_selective()
    @          0x3aefe75 starrocks::pipeline::ExchangeSinkOperator::push_chunk()
    @          0x3868b53 starrocks::pipeline::PipelineDriver::process()
    @          0x385a70e starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
    @          0x2e7d84c starrocks::ThreadPool::dispatch_thread()
    @          0x2e774ca starrocks::Thread::supervise_thread()
    @     0x7f8329ee4dd5 start_thread
    @     0x7f83292e6ead __clone
    @                0x0 (unknown)

能回放出来这个crash吗

没回放,这是生产环境,多个业务模块同时在用;如果抓不到sql很难定位到哪个业务模块导致的 :joy:

这个crash不是有这个query id吗

fe.log fe.audit.log 都没有这个?

三个fe的fe.audit.log里都搜过了,没有 :joy:

fe.log/fe.warn.log 里也找下,可能是重试导致query_id 变了或者是统计信息的SQL

在fe.warn.log里找到了 :joy:

2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36695|1238994) [FragmentInstanceExecState.waitForDeploymentCompletion():273] catch a execute exception
java.util.concurrent.ExecutionException: A error occurred: errorCode=2001 errorMessage:Connection reset by peer
        at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:578) ~[jprotobuf-rpc-core-4.2.1.jar:?]
        at com.starrocks.qe.scheduler.dag.FragmentInstanceExecState.waitForDeploymentCompletion(FragmentInstanceExecState.java:267) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.scheduler.Deployer.waitForDeploymentCompletion(Deployer.java:225) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.scheduler.Deployer.deployFragments(Deployer.java:116) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.DefaultCoordinator.deliverExecFragments(DefaultCoordinator.java:581) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.DefaultCoordinator.startScheduling(DefaultCoordinator.java:494) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.scheduler.Coordinator.startScheduling(Coordinator.java:102) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.scheduler.Coordinator.exec(Coordinator.java:85) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1084) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:606) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:415) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:610) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:917) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: com.baidu.jprotobuf.pbrpc.ErrorDataException: A error occurred: errorCode=2001 errorMessage:Connection reset by peer
        at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.doWaitCallback(ProtobufRpcProxy.java:651) ~[jprotobuf-rpc-core-4.2.1.jar:?]
        at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy.access$0(ProtobufRpcProxy.java:611) ~[jprotobuf-rpc-core-4.2.1.jar:?]
        at com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy$2.get(ProtobufRpcProxy.java:576) ~[jprotobuf-rpc-core-4.2.1.jar:?]
        ... 16 more
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36706|1239057) [FragmentInstanceExecState.waitForDeploymentCompletion():297] exec plan fragment failed, errmsg=exec rpc error. backend [id=122861700] [host=10.133.58.205], code=THRIFT_RPC_ERROR, fragmentId=F00, backend=10.133.58.205:9060
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36689|1238968) [FragmentInstanceExecState.waitForDeploymentCompletion():297] exec plan fragment failed, errmsg=exec rpc error. backend [id=122861700] [host=10.133.58.205], code=THRIFT_RPC_ERROR, fragmentId=F00, backend=10.133.58.205:9060
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36703|1239031) [SimpleScheduler.addToBlacklist():198] add black list 122861700
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36695|1238994) [FragmentInstanceExecState.waitForDeploymentCompletion():297] exec plan fragment failed, errmsg=exec rpc error. backend [id=122861700] [host=10.133.58.205], code=THRIFT_RPC_ERROR, fragmentId=F00, backend=10.133.58.205:9060
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36703|1239031) [DefaultCoordinator.getNext():756] get next fail, need cancel. status errorCode THRIFT_RPC_ERROR A error occurred: errorCode=2001 errorMessage:Connection reset by peer, query id: 546f2436-4a25-11ef-a401-347379ae3e49
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36706|1239057) [SimpleScheduler.addToBlacklist():198] add black list 122861700
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36703|1239031) [DefaultCoordinator.updateStatus():731] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 546f2436-4a25-11ef-a401-347379ae3e49, instance id: NaN
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36706|1239057) [StmtExecutor.execute():621] retry 1 times. stmt: SELECT CONNECTION_ID()
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36689|1238968) [SimpleScheduler.addToBlacklist():198] add black list 122861700
2024-07-25 09:29:45.975+08:00 WARN (starrocks-mysql-nio-pool-36706|1239057) [StmtExecutor.execute():644] Query 5bb8fd0d-4a25-11ef-a401-347379ae3e49 failed. Planner profile : Planner:
   - -- Total[1] 0
   -     -- Analyzer[1] 0
   -     -- Transformer[1] 0
   -     -- Optimizer[1] 0
   -         -- preprocessMvs[1] 0
   -             -- chooseCandidates[1] 0
   -             -- generateMvPlan[1] 0
   -             -- validateMv[1] 0
   -             -- mvWithView[1] 0
   -         -- RuleBaseOptimize[1] 0
   -         -- CostBaseOptimize[1] 0
   -         -- PhysicalRewrite[1] 0
   -         -- PlanValidate[1] 0
   -             -- InputDependenciesChecker[1] 0
   -             -- TypeChecker[1] 0
   -             -- CTEUniqueChecker[1] 0
   -             -- ColumnReuseChecker[1] 0
   -     -- ExecPlanBuild[1] 0
   - -- Pending[1] 0
   - -- Prepare[1] 0
   - -- Deploy[1] 7s487ms
   -     -- DeployLockInternalTime[1] 7s487ms
   -         -- DeploySerializeConcurrencyTime[1] 0
   -         -- DeployStageByStageTime[2] 0
   -         -- DeployWaitTime[2] 7s487ms
   -             -- DeployAsyncSendTime[1] 0
   - DeployDataSize: 1617
  Reason:

替换了新版本BE解决.