CN隔一段时间就滚动重启,任务异常

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】starrocks cn隔一段时间就重启
【背景】
k8s部署starrocks,3fe + 3cn 存算分离 腾讯的eks(k8s) + cos对象存储 + cfs挂载日志和cn缓存
3.1.7版本,后来升级到3.1.11 就开始反馈有cn间断重启了,貌似不升级的时候是没问题的。不知道是不是升级的时候导致的问题

【业务影响】数仓任务通过jdbc执行insert任务 任务异常

到处找出来的一些日志

任务调度异常日志:
[ERROR] 2024-06-17 08:48:20.630 - [taskAppId=TASK-1144-589035-3175704]:[254] - task scheduler failure
java.lang.RuntimeException: Runtime error: encode size does not equal when decoding, encode size = 4294967296, but decode get size = 1265, raw size = 2855.

[ERROR] 2024-06-17 08:37:14.666 - [taskAppId=TASK-1142-589033-3175558]:[372] - Failed to get primary backend. shard id: 2648066
java.sql.SQLSyntaxErrorException: Failed to get primary backend. shard id: 2648066

[ERROR] 2024-06-17 08:37:19.086 - [taskAppId=TASK-1144-589035-3175555]:[254] - task scheduler failure
java.lang.RuntimeException: rpc failed, host: starrocks-cn-2.starrocks-cn-search.starrocks.svc.cluster.local

有好几个sql task都会出现上面的异常,绝大部分任务都正常。
不清楚原因到底是什么 哪个异常导致的哪个异常。

有个sql经常发生异常,拷贝出来,手动执行,貌似又没有问题。
insert overwrite db_elderly_care_ads.ads_clique_operation_client_dap partition(part_date=‘2024’) – 客户
with
ac_client as
( select t.org_id , 1 public_client_cmt – 公海客户
, case when charger_id != ‘’ or sharer_id != ‘’ then 1 else 0 end client_cmt – 客户数
, case when user_id != ‘’ then 1 else 0 end user_cmt – 用户数
, null order_cmt , null finish_order_cmt from db_elderly_care_ods.ods_tb_client_info_dap t where t.delete_status = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-13’, 12)) – 时间 – 时间区级
) , – 订单
ac_order as
( select t.org_id , 0 public_client_cmt – 公海客户
, 0 client_cmt – 客户数
, 0 user_cmt – 用户数
, t.client_id order_cmt – 订单数
, case when o.order_status = 5 then t.client_id else null end finish_order_cmt – 完成订单数
from db_elderly_care_ods.ods_tb_client_info_dap t left join db_elderly_care_ods.ods_tb_order_dap as o on t.user_id = o.customer_id where t.delete_status = 0 and o.delete_status = 0 and o.order_type = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-13’, 12)) – 时间 – 时间区级
) , – 数据汇总 + 关联维度
se_vi as
( select one_level , two_level , three_level , four_level , five_level , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ( select * from ac_client union all select * from ac_order ) s join db_elderly_care_dim.dim_org_level_dap e on s.org_id = e.org_id and org_type_id in (3,4, 5, 6, 7, 8, 9) – 长者跟养老机构挂钩,与集团本身无关
) ,
ac_re as
( select one_level , two_level , three_level , four_level , five_level , sum(public_client_cmt) public_client_cmt , sum(client_cmt) client_cmt , sum(user_cmt) user_cmt , count(distinct order_cmt) order_cmt , count(distinct finish_order_cmt) finish_order_cmt from se_vi group by grouping sets ( (one_level) , (two_level) , (three_level) , (four_level) , (five_level) ) ),
ac_bb as
( select case when one_level != ‘’ then one_level when two_level != ‘’ then two_level when three_level != ‘’ then three_level when four_level != ‘’ then four_level when five_level != ‘’ then five_level end org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ac_re
),
ac_dd as
( select org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , ifnull(client_cmt / public_client_cmt, 0.00) client_cmt_rate , ifnull(user_cmt / client_cmt, 0.00) user_cmt_rate , ifnull(order_cmt / user_cmt, 0.00) order_cmt_rate , ifnull(finish_order_cmt / order_cmt, 0.00) finish_order_cmt_rate from ac_bb where org_id != ‘null’ ),
ac_cc as
( select b.org_id , o.org_name , o.tenant_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , client_cmt_rate , user_cmt_rate , order_cmt_rate , finish_order_cmt_rate , substr(current_timestamp(),1, 19) , ‘y’ , substring(‘2024-06-13’, 1, 4) from ac_dd b join db_elderly_care_ods.ods_tb_org_dap o on b.org_id = o.org_id
)
select * from ac_cc

cn.out日志:
start time: Mon Jun 17 08:37:32 CST 2024
3.1.11 RELEASE (build 34f131b)
query_id:d286b8c1-2c43-11ef-a1b2-525400df7046, fragment_instance:d286b8c1-2c43-11ef-a1b2-525400df7060
tracker:process consumption: 4145881744
tracker:query_pool consumption: 712584224
tracker:load consumption: 0
tracker:metadata consumption: 20034043
tracker:tablet_metadata consumption: 162087
tracker:rowset_metadata consumption: 0
tracker:segment_metadata consumption: 3129185
tracker:column_metadata consumption: 16742771
tracker:tablet_schema consumption: 162087
tracker:segment_zonemap consumption: 1950236
tracker:short_key_index consumption: 238325
tracker:column_zonemap_index consumption: 4199267
tracker:ordinal_index consumption: 4431752
tracker:bitmap_index consumption: 2400
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 2721436736
tracker:update consumption: 34080
tracker:chunk_allocator consumption: 320995192
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1718585528 (unix time) try “date -d @1718585528” if you are using GNU date ***
PC: @ 0x85579c5 svb_decode_avx_simple
*** SIGSEGV (@0x7f60539d4000) received by PID 27 (TID 0x7f60c26c4640) from PID 1402814464; stack trace: ***
@ 0x7cfae6a google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f6127cd1520 (unknown)
@ 0x85579c5 svb_decode_avx_simple
@ 0x8557d21 streamvbyte_decode
@ 0x55619ff starrocks::serde::(anonymous namespace)::decode_integers<>()
@ 0x5563a5b starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d1001d starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x5566ccf starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d638dd starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x5566c83 starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d6561d starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x65620ba starrocks::serde::ProtobufChunkDeserializer::deserialize()
@ 0x662074b starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
@ 0x66249c7 starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
@ 0x66179d6 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
@ 0x60a9493 starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
@ 0x5c7e4d0 starrocks::pipeline::PipelineDriver::process()
@ 0x6508830 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x68ac92c starrocks::ThreadPool::dispatch_thread()
@ 0x68a5d6a starrocks::thread::supervise_thread()
@ 0x7f6127d23ac3 (unknown)
@ 0x7f6127db5850 (unknown)
@ 0x0 (unknown)
start time: Mon Jun 17 08:52:35 CST 2024

cn.INFO.xxx日志:
I0617 08:52:08.569945 365 pipeline_driver_executor.cpp:326] [Driver] Fail to report exec state due to query not found: fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df7048
I0617 08:52:08.574666 368 tablet_sink.cpp:2008] Olap table sink statistics. load_id: d286b8c1-2c43-11ef-a1b2-525400df7046, txn_id: 9853481, add chunk time(ms)/wait lock time(ms)/num: {2649730:(0)(0)(1)} {2612629:(0)(0)(1)} {2612632:(0)(0)(1)}
I0617 08:52:08.576160 365 pipeline_driver_executor.cpp:326] [Driver] Fail to report exec state due to query not found: fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df705d
I0617 08:52:08.576581 365 pipeline_driver_executor.cpp:326] [Driver] Fail to report exec state due to query not found: fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df7059
W0617 08:52:08.586138 369 stack_util.cpp:350] 2024-06-17 08:52:08.586106, query_id=d286b8c1-2c43-11ef-a1b2-525400df7046, fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df7060 throws exception: std::runtime_error, trace:
W0617 08:52:08.586223 369 pipeline_driver.cpp:308] pull_chunk returns not ok status Runtime error: Runtime error: encode size does not equal when decoding, encode size = 4294973428, but decode get size = 1989, raw size = 6120.
W0617 08:52:08.586256 369 pipeline_driver_executor.cpp:166] [Driver] Process error, query_id=d286b8c1-2c43-11ef-a1b2-525400df7046, instance_id=d286b8c1-2c43-11ef-a1b2-525400df7060, status=Runtime error: Runtime error: encode size does not equal when decoding, encode size = 4294973428, but decode get size = 1989, raw size = 6120.
W0617 08:52:08.599530 371 stack_util.cpp:350] 2024-06-17 08:52:08.599493, query_id=d286b8c1-2c43-11ef-a1b2-525400df7046, fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df7057 throws exception: std::runtime_error, trace:
W0617 08:52:08.601045 371 pipeline_driver.cpp:308] pull_chunk returns not ok status Runtime error: Runtime error: encode size does not equal when decoding, encode size = 9178344165883248640, but decode get size = 11944910, raw size = 27852800.
W0617 08:52:08.601069 371 pipeline_driver_executor.cpp:166] [Driver] Process error, query_id=d286b8c1-2c43-11ef-a1b2-525400df7046, instance_id=d286b8c1-2c43-11ef-a1b2-525400df7057, status=Runtime error: Runtime error: encode size does not equal when decoding, encode size = 9178344165883248640, but decode get size = 11944910, raw size = 27852800.
I0617 08:52:08.601572 365 pipeline_driver_executor.cpp:326] [Driver] Fail to report exec state due to query not found: fragment_instance_id=d286b8c1-2c43-11ef-a1b2-525400df7057
I0617 08:52:35.489236 27 daemon.cpp:289] version 3.1.11-34f131b

k8s cn pod日志:

有没有可能是k8s 3.1.7升级3.1.11的时候,不能保证leader最后重启导致的版本不兼容?

观察了调度工作流,发现几个任务实例调度的时候,完全调度失败,没有成功过,单独select * from table所有表,都没有异常,于是删表,重新从正式环境导数过来,再调度了一个任务,运行成功。具体效果待后续观察。

核心异常:
java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 86, column row count: 18446744073709551615, expected row count: 200. There is probably a bug here.

java.lang.RuntimeException: Runtime error: encode size does not equal when decoding, encode size = 4294967296, but decode get size = 1662, raw size = 2444.

其他连带异常:
[ERROR] 2024-06-17 12:37:27.418 - [taskAppId=TASK-1144-589901-3178891]:[372] - create partitions failed: Table creation timed out.
You can increase the timeout by increasing the config “tablet_create_timeout_second” and try again.
To increase the config “tablet_create_timeout_second” (currently 10), run the following command:

[ERROR] 2024-06-18 12:37:40.930 - [taskAppId=TASK-1144-594052-3195053]:[372] - Backend node not found. Check if any backend node is down.
java.sql.SQLSyntaxErrorException: Backend node not found. Check if any backend node is down.

java.lang.RuntimeException: [E104]Fail to read from Socket{id=340 fd=193 addr=10.10.10.202:8060:42370} (0x0x7fe369bae340): Connection reset by peer [R1][E112]Not connected to 10.10.10.202:8060 yet, server_id=340 [R2][E112]Not connected to 10.10.10.202:8060 yet, server_id=340 [R3][E112]Not connected to 10.10.10.202:8060 yet, server_id=340
at org.apache.dolphinscheduler.server.worker.task.sql.SqlTask.executeFuncAndSql(SqlTask.java:373)

不知道是什么原因导致的。

重新导数之后,还是不行,偶尔能成功,但是大概率复现

java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 86, column row count: 18446744073709551615, expected row count: 200. There is probably a bug here.

java.lang.RuntimeException: Runtime error: encode size does not equal when decoding, encode size = 4294967296, but decode get size = 1662, raw size = 2444.

  1. 根据query id 到fe.audit.log 中搜一下对应sql,测试下是否执行该sql必现 cn 重启
  2. 测试后,发一下最近 cn.out 的完整日志

昨天试过了一下,sql就是上文里面粘贴的sql,还有其他几个,搜了一下task,发现4个task都这样。mysql客户端手动执行该sql,是可以成功的。
基于task jdbc去调度执行,有的时候可以成功,有的时候不行。
cn.out 就是上面的格式内容。

这个是早上翻出来的失败数据。其中一个cn的日志。
cn 日志:
start time: Wed Jun 19 08:48:14 CST 2024
3.1.11 RELEASE (build 34f131b)
query_id:2f13746c-2dd6-11ef-893b-525400e32b8f, fragment_instance:2f13746c-2dd6-11ef-893b-525400e32ba9
tracker:process consumption: 5996113600
tracker:query_pool consumption: 3761682504
tracker:load consumption: 0
tracker:metadata consumption: 6561182
tracker:tablet_metadata consumption: 65995
tracker:rowset_metadata consumption: 0
tracker:segment_metadata consumption: 1017435
tracker:column_metadata consumption: 5477752
tracker:tablet_schema consumption: 65995
tracker:segment_zonemap consumption: 520019
tracker:short_key_index consumption: 145788
tracker:column_zonemap_index consumption: 1554712
tracker:ordinal_index consumption: 1515816
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 1870804096
tracker:update consumption: 68370
tracker:chunk_allocator consumption: 69725272
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1718758341 (unix time) try “date -d @1718758341” if you are using GNU date ***
PC: @ 0x8557ace svb_decode_avx_simple
*** SIGSEGV (@0x7fc28176c000) received by PID 27 (TID 0x7fc2f326a640) from PID 18446744071586627584; stack trace: ***
@ 0x7cfae6a google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fc358cfe520 (unknown)
@ 0x8557ace svb_decode_avx_simple
@ 0x8557d21 streamvbyte_decode
@ 0x55619ff starrocks::serde::(anonymous namespace)::decode_integers<>()
@ 0x5563a5b starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d1001d starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x5566ccf starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d638dd starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x5566c83 starrocks::ColumnVisitorMutableAdapter<>::visit()
@ 0x3d6561d starrocks::ColumnFactory<>::accept_mutable()
@ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize()
@ 0x65620ba starrocks::serde::ProtobufChunkDeserializer::deserialize()
@ 0x662074b starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk()
@ 0x66249c7 starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk()
@ 0x66179d6 starrocks::DataStreamRecvr::get_chunk_for_pipeline()
@ 0x60a9493 starrocks::pipeline::ExchangeSourceOperator::pull_chunk()
@ 0x5c7e4d0 starrocks::pipeline::PipelineDriver::process()
@ 0x6508830 starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@ 0x68ac92c starrocks::ThreadPool::dispatch_thread()
@ 0x68a5d6a starrocks::thread::supervise_thread()
@ 0x7fc358d50ac3 (unknown)
@ 0x7fc358de2850 (unknown)
@ 0x0 (unknown)

sql:

insert overwrite db_elderly_care_ads.ads_clique_operation_client_dap partition(part_date=‘2024’) – 客户
with ac_client as ( select t.org_id , 1 public_client_cmt – 公海客户
, case when charger_id != ‘’ or sharer_id != ‘’ then 1 else 0 end client_cmt – 客户数
, case when user_id != ‘’ then 1 else 0 end user_cmt – 用户数
, null order_cmt , null finish_order_cmt from db_elderly_care_ods.ods_tb_client_info_dap t where t.delete_status = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-19’, 12)) – 时间 – 时间区级
) , – 订单
ac_order as ( select t.org_id , 0 public_client_cmt – 公海客户
, 0 client_cmt – 客户数
, 0 user_cmt – 用户数
, t.client_id order_cmt – 订单数
, case when o.order_status = 5 then t.client_id else null end finish_order_cmt – 完成订单数
from db_elderly_care_ods.ods_tb_client_info_dap t left join db_elderly_care_ods.ods_tb_order_dap as o on t.user_id = o.customer_id where t.delete_status = 0 and o.delete_status = 0 and o.order_type = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-19’, 12)) – 时间 – 时间区级
) , – 数据汇总 + 关联维度
se_vi as ( select one_level , two_level , three_level , four_level , five_level , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ( select * from ac_client union all select * from ac_order ) s join db_elderly_care_dim.dim_org_level_dap e on s.org_id = e.org_id and org_type_id in (3,4, 5, 6, 7, 8, 9) – 长者跟养老机构挂钩,与集团本身无关
) , ac_re as ( select one_level , two_level , three_level , four_level , five_level , sum(public_client_cmt) public_client_cmt , sum(client_cmt) client_cmt , sum(user_cmt) user_cmt , count(distinct order_cmt) order_cmt , count(distinct finish_order_cmt) finish_order_cmt from se_vi group by grouping sets ( (one_level) , (two_level) , (three_level) , (four_level) , (five_level) ) ), ac_bb as ( select case when one_level != ‘’ then one_level when two_level != ‘’ then two_level when three_level != ‘’ then three_level when four_level != ‘’ then four_level when five_level != ‘’ then five_level end org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ac_re ), ac_dd as ( select org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , ifnull(client_cmt / public_client_cmt, 0.00) client_cmt_rate , ifnull(user_cmt / client_cmt, 0.00) user_cmt_rate , ifnull(order_cmt / user_cmt, 0.00) order_cmt_rate , ifnull(finish_order_cmt / order_cmt, 0.00) finish_order_cmt_rate from ac_bb where org_id != ‘null’ ), ac_cc as ( select b.org_id , o.org_name , o.tenant_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , client_cmt_rate , user_cmt_rate , order_cmt_rate , finish_order_cmt_rate , substr(current_timestamp(),1, 19) , ‘y’ , substring(‘2024-06-19’, 1, 4) from ac_dd b join db_elderly_care_ods.ods_tb_org_dap o on b.org_id = o.org_id ) select * from ac_cc

task jdbc异常:
java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 151, column row count: 8, expected row count: 244. There is probably a bug here.





这个是几种异常的部分源码位置。但是看不懂。

有没有可能是3.1.7升级到3.1.11的时候出现的问题?比如k8s部署的时候,不能保证一定最后重启fe的leader,顺序错位有没有影响?cn cache会不会有问题?或者有没有可能是我们task不规范,多个task操作一个表?或者什么的?因为jdbc调度,偶尔是可以成功的。偶尔又报这种异常。

回退到3.1.7,再升级到3.1.8就正常。不知道为什么3.1.11出问题

升级到3.1.11 和 3.1.12都有这个异常 稳定复现

3.1.9也有这个问题 有大佬可以分析一下吗?

能找到个稳定复现case吗,回放这个有问题的SQL

我用做了一个实验,单fe + 单cn ,物理机, 存储在腾讯的cos上。如果不开启fqdn,没有这个异常,开启fqdn,有这个异常。

发个explain costs 我看看

dim_org_level_dap333.txt0.csv (100.0 KB) 异常反馈2 (4.7 KB)
我搞了一个demo出来,你看看能不能复现?数据是简单脱敏的

你对这个SQL执行下explain costs 我先快速看一下

aaa.costs (148.7 KB)

异常环境执行:

explain costs select one_level
, two_level
, three_level
, four_level
, five_level
, count(distinct device_cmt) device_cmt
, count(distinct online_cmt) online_cmt
, count(distinct unonline_cmt) unonline_cmt
, count(distinct safe_cmt) safe_cmt
, count(distinct care_cmt) care_cmt
, count(distinct treat_cmt) treat_cmt
, count(distinct health_cmt) health_cmt
, count(distinct sport_cmt) sport_cmt
, count(distinct execise_cmt) execise_cmt
, count(distinct home_cmt) home_cmt
, count(distinct visit_cmt) visit_cmt
, count(distinct other_cmt) other_cmt
from dim_org_level_dap333
group by
grouping sets (
(one_level)
, (two_level)
, (three_level)
, (four_level)
, (five_level)
)

这个SQL是报哪个错误呢?

异常:
1. java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 86, column row count: 18446744073709551615, expected row count: 200. There is probably a bug here.

2. java.lang.RuntimeException: Runtime error: encode size does not equal when decoding, encode size = 4294967296, but decode get size = 1662, raw size = 2444.

这种异常


类似与这三个异常,随机复现