StarRocks查询Hive catalog有时候会报错File does not exist

【StarRocks版本】 2.5.22
【集群规模】3fe+3be(fe与be混部)
【问题描述】应用使用,StarRocks查询Hive catalog加速,但是有时候查询会报错File does not exist。


【日志报错】详细日志见附件

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/hive/warehouse/ads_bdc_services.db/ads_services_week_org_type_sal/partition_day=2024-12-15/part-00026-0c8fc350-c9c7-4d2e-85aa-1f9774eed2f5.c000.snappy.parquet
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567)
        at org.apache.hadoop.ipc.Client.call(Client.java:1513)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
        at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:334)
        at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)

be.out (18.1 MB)

有大佬帮忙看下吗?

这个应该是hdfs那边做了ovewrite类似的操作吧,sr这边开启了元数据cache吧?这种一般第二次查询应该就可以查询了

hdfs没有做操作,我们都是凌晨跑数。现在是业务sr通过 Hive catalog 查询外表偶尔会报错,使用spark查询没问题。


报错的文件在hdfs不存在吧

另外确认下hive查询报错吗,sr查的hive元数据

另外确认下元数据cache是否开启的

另外可以在fe日志搜下background refresh hive metadata

1、是的,报错的这个parquet文件在hdfs里面不存在的。

2、hive中可以查询,使用beeline查询该表没有报错

select max(etl_time) from ads_bdc_services.ads_services_week_org_type_sal  where partition_day='2024-12-15';
+------------------------+
|          _c0           |
+------------------------+
| 2024-12-18 18:43:29.0  |
+------------------------+

3、元数据cache是开启的:

mysql>  ADMIN SHOW FRONTEND CONFIG like '%enable_background_refresh_connector_metadata%';
+----------------------------------------------+------------+-------+---------+-----------+---------+
| Key                                          | AliasNames | Value | Type    | IsMutable | Comment |
+----------------------------------------------+------------+-------+---------+-----------+---------+
| enable_background_refresh_connector_metadata | []         | true  | boolean | true      |         |
+----------------------------------------------+------------+-------+---------+-----------+---------+

4、老师,关于refresh的信息,fe.log只有这些信息:

老师,我上传了fe.log日志,帮忙分析下 :pray:感谢,看下还差哪些信息我可以提供。fe.log.zip (32.6 MB)

老师,这个问题有发现吗?

2024-12-19 14:43:36,659 INFO (com.starrocks.connector.hive.ConnectorTableMetadataProcessor|44) [CacheUpdateProcessor.refreshTableBackground():102] dsm_hive_catalog.ads_bdc_services.ads_services_week_org_type_sal
partitions has updated, updated partition size is 260, refresh partition and file success
2024-12-19 14:43:36,669 INFO (com.starrocks.connector.hive.ConnectorTableMetadataProcessor|44) [ConnectorTableMetadataProcessor.refreshCatalogTable():113] refresh table dsm_hive_catalog.ads_bdc_services.ads_services_week_org_type_sal success

看着是这个时间点刷新的分区,查询报错是在这个之前还是之后

老师,报错之前就有了,我们是报错了才发现有问题,手动执行了一遍

老师,有解决方案吗?