【BE Crash】Be异常退出crash

【详述】Be不定期Crash
【背景】BE查询Hive外表,be.out日志多次报错找不到HDFS FILE 之后,会有crash退出
【业务影响】暂时通过重启解决
【StarRocks版本】2.3.4
【集群规模】例如:3fe(3 follower)+12be
【机器信息】64C/256G
【联系方式】 linenwei1@jd.com
【附件】

  • be crash
  • be.out

@trueeyu 大佬,有时间的时候辛苦帮忙看下,是不是有相关PR修复,蟹蟹 。

be.out是不是发的不全啊

hdfsOpenFile(hdfs://xxx/xxxx): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
RemoteException: File does not exist: /xxx/xxx/xxx
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:155)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2150)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:795)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:554)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1105)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1069)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:996)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2010)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3068)
java.io.FileNotFoundException: File does not exist: /xxx/xxx
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:155)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2150)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:795)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:554)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1105)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1069)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:996)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2010)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3068)

at sun.reflect.GeneratedConstructorAccessor2.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:894)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:881)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:870)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1038)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:333)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:346)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /xxx/xxx
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:155)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2150)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:795)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:554)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1105)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1069)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:996)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2010)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3068)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1562)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
at com.sun.proxy.$Proxy9.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:333)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:892)
... 7 more

query_id:fb9b0c25-b29f-11ed-9ed1-fa163e6c0ff8, fragment_instance:fb9b0c25-b29f-11ed-9ed1-fa163e6c0ff9
*** Aborted at 1677063571 (unix time) try “date -d @1677063571” if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 29374 (TID 0x7f3bfb4d0700) from PID 0; stack trace: ***
@ 0x3f1b042 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f3c9573c1a2 os::Linux::chained_handler()
@ 0x7f3c95742826 JVM_handle_linux_signal
@ 0x7f3c95738e13 signalHandler()
@ 0x7f3c94c0a630 (unknown)
@ 0x0 (unknown)
start time: Wed Feb 22 19:00:04 CST 2023

be.out里有多个hdfsOpenFile找不到hdfs路径的报错,be多次打出上面日志之后就会有crash. 日志里的hdfs path出于公司要求,我用xxx替换了。

去fe.audit.log找下这个sql: fb9b0c25-b29f-11ed-9ed1-fa163e6c0ff8

hdfs 文件路径找不到这个我理解其实是正常的,在读写并发场景下,hdfs的文件会rename,如果此时fe侧的元数据没有刷新,就会报路径找不到的错误, 后面hdfs的元数据刷新之后就可以重新找到了。fb9b0c25-b29f-11ed-9ed1-fa163e6c0ff8 这个sql 方便加下wx吗?我wx发你sql。wx: 17610907960

稍等下,有点事,过会加你

嗯呢,好滴,不着急。

抱歉,这块复现了下SQL,是我们自己对SR修改聚合下推触发了某个SQL 的 corner case导致的。感谢 @trueeyu的解答