k8s 使用Operator模式部署集群,连接缓慢,创建表失败,获取数据库下表列表失败

【是否存算分离】否
【StarRocks版本】例如:3.2.6
【集群规模】例如:1fe+3be
【详述】版本 3.2.6

问题1:部署完毕集群,测试连接缓慢,经常卡住在50%,等待近四分钟,显示连接成功。


问题2:连接到starrocks集群后,创建一个数据库testDb,点击进入数据库查看信息,一直在载入表和load testDb properties,最后报错

SQL 错误 [1064] [42000]: The brpc stub of starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local:8060 is null. backend [id=10005] [host=starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local]
The brpc stub of starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local:8060 is null. backend [id=10005] [host=starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local]
The brpc stub of starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local:8060 is null. backend [id=10005] [host=starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local]



问题3:在客户端手动执行创建表,等待几分钟报错

InvocationTargetException

java.lang.reflect.InvocationTargetException

SQL 错误 [1064] [42000]: Couldn't open transport for starrockscluster-sample-fe-0.starrockscluster-sample-fe-search.starrocks.svc.cluster.local:9020 (Could not resolve host for client socket.), host: unknown

SQL 错误 [1064] [42000]: Couldn't open transport for starrockscluster-sample-fe-0.starrockscluster-sample-fe-search.starrocks.svc.cluster.local:9020 (Could not resolve host for client socket.), host: unknown

  Couldn't open transport for starrockscluster-sample-fe-0.starrockscluster-sample-fe-search.starrocks.svc.cluster.local:9020 (Could not resolve host for client socket.), host: unknown

  Couldn't open transport for starrockscluster-sample-fe-0.starrockscluster-sample-fe-search.starrocks.svc.cluster.local:9020 (Could not resolve host for client socket.), host: unknown

问题四: 使用sql建表,报错,创建超时,按照错误信息,将tablet_create_timeout_second参数调大,最后还是报此错误。

Unexpected exception: Table creation timed out.

You can increase the timeout by increasing the config “tablet_create_timeout_second” and try again.

To increase the config “tablet_create_timeout_second” (currently 10), run the following command:

部署的yaml如下

starrocks-fe-and-be.yaml (3.5 KB)

kubectl看看各个POD是否是健康正常的.

FE, BE的日志都发一下看看.

fe日志:fe-log.7z (181.6 KB)
be0日志:be0-log.7z (324.0 KB)
be1日志:be1-log.7z (320.9 KB)
be2日志:be2-log.7z (332.1 KB)

2024-05-24 00:01:43.104Z WARN (thrift-server-pool-7605|8385) [SystemInfoService.getComputeNodeWithBePortCommon():623] failed to get right ip by fqdn starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local: Temporary failure in name resolution
2024-05-24 00:01:43.104Z WARN (thrift-server-pool-7606|8386) [SystemInfoService.getComputeNodeWithBePortCommon():623] failed to get right ip by fqdn starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local
2024-05-24 00:02:15.172Z WARN (thrift-server-pool-7608|8390) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local: Temporary failure in name resolution
2024-05-24 00:02:15.172Z WARN (thrift-server-pool-7609|8391) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local
2024-05-24 00:02:15.178Z WARN (thrift-server-pool-7612|8394) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local
2024-05-24 00:02:32.779Z WARN (thrift-server-pool-7611|8393) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-0.starrockscluster-sample-be-search.starrocks.svc.cluster.local: Temporary failure in name resolution
2024-05-24 00:02:55.221Z WARN (thrift-server-pool-7617|8399) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local
2024-05-24 00:02:55.221Z WARN (thrift-server-pool-7618|8400) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local
2024-05-24 00:02:55.221Z WARN (thrift-server-pool-7613|8395) [SystemInfoService.getComputeNodeWithBePortCommon():614] failed to get right ip by fqdn starrockscluster-sample-be-2.starrockscluster-sample-be-search.starrocks.svc.cluster.local: Temporary failure in name resolution

部署集群的DNS可能有问题. 可以仔细排查一下.

老师,是要排查下k8s的dns?具体怎么排查能细说下吗?

老师已解决

遇到了,同样的问题,请问,是如何解决的dns问题?