Spark SQL Load 数据写入SR主键模型

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】使用Spark SQL 写入SR 主键模型任务卡死
【背景】做过哪些操作?
【业务影响】
【StarRocks版本】例如:3.1
【集群规模】例如:2fe(1 follower)+3be
【机器信息】
【表模型】:主键模型
【导入或者导出方式】Spark sql 导入
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】
I0927 19:16:56.807106 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=7f49ddad-1549-e6d3-410e-49b6bab87e92
I0927 19:16:56.807111 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 7f49ddad-1549-e6d3-410e-49b6bab87e92
I0927 19:16:56.807113 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=33457f0e-c2f8-e1f4-622a-bdd18625c1a8
I0927 19:16:56.807116 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 33457f0e-c2f8-e1f4-622a-bdd18625c1a8
I0927 19:16:56.807121 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=bc46c141-4c31-5ce7-e881-d407ee3a529f
I0927 19:16:56.807123 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment bc46c141-4c31-5ce7-e881-d407ee3a529f
I0927 19:16:56.807126 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=1e414a4e-9e07-a878-494e-448bca20fbae
I0927 19:16:56.807129 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 1e414a4e-9e07-a878-494e-448bca20fbae
I0927 19:16:56.807133 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=83441e50-998a-4f15-f2f0-b2c3f21ae6bc
I0927 19:16:56.807137 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 83441e50-998a-4f15-f2f0-b2c3f21ae6bc
I0927 19:16:56.807140 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=14444aa4-a4df-1d05-8775-dc9e612fb1a3
I0927 19:16:56.807143 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 14444aa4-a4df-1d05-8775-dc9e612fb1a3
I0927 19:16:56.807147 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=874f43ed-106f-fe03-41b4-4d070a51faa9
I0927 19:16:56.807149 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 874f43ed-106f-fe03-41b4-4d070a51faa9
I0927 19:16:56.807153 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=4e45ee09-8118-18fc-6991-586b46e9cb83
I0927 19:16:56.807157 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 4e45ee09-8118-18fc-6991-586b46e9cb83
I0927 19:16:56.807159 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=aa4b57e2-3691-8958-0cb8-7c060618f688
I0927 19:16:56.807163 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment aa4b57e2-3691-8958-0cb8-7c060618f688
I0927 19:16:56.807166 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=094c0b93-65dc-c428-6f1c-744c7d6571a1
I0927 19:16:56.807169 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 094c0b93-65dc-c428-6f1c-744c7d6571a1
I0927 19:16:56.807173 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=fd48b5ac-d13a-0000-e102-616f57feeba0
I0927 19:16:56.807175 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment fd48b5ac-d13a-0000-e102-616f57feeba0
I0927 19:16:56.807179 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=c843c7a8-547c-2969-4ca2-716a3af46cad
I0927 19:16:56.807183 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment c843c7a8-547c-2969-4ca2-716a3af46cad
I0927 19:16:56.807186 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=4e4d2dec-4005-0268-689c-fc689e0754b7
I0927 19:16:56.807189 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 4e4d2dec-4005-0268-689c-fc689e0754b7
I0927 19:16:56.807193 102144 plan_fragment_executor.cpp:360] cancel(): fragment_instance_id=8c4819ab-2c41-f160-ee5b-258431775a9b
I0927 19:16:56.807195 102144 fragment_mgr.cpp:550] FragmentMgr cancel worker going to cancel timeout fragment 8c4819ab-2c41-f160-ee5b-258431775

现象:Spark 任务hang住 BE 节点的 8040 端口无法访问,be.Info 日志出现大量如上所示错误,重启be节点后可重新写入,但是过一段时间后又出现上述问题

看着应该是死锁了,下次出现的时候拿个pstack,pstack $be_pid > /tmp/pstack.log

调大 be.conf be_http_num_workers=96 scanner_thread_pool_thread_num=96 试试

请问下这个问题好了吗

谢谢大佬,我遇到的问题找到了是因为网络的问题,也是出现很多FragmentMgr cancel worker going to cancel timeout fragment