【元数据】FE服务meta源数据一直增长,节点重启回放时间久

这个日志是否说明已经在执行checkpoints了。
2023-07-24 12:26:58,990 INFO (leaderCheckpointer|144) [Checkpoint.runAfterCatalogReady():82] checkpoint imageVersion 1662303, checkPointVersion 98641645
2023-07-24 12:26:58,990 INFO (leaderCheckpointer|144) [Checkpoint.replayAndGenerateGlobalStateMgrImage():197] begin to generate new image: image.98641645
2023-07-24 12:26:58,995 INFO (leaderCheckpointer|144) [Auth.grantRoleInternal():779] grant operator to ‘root’@’%’, isReplay = true
2023-07-24 12:26:58,995 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadImage():1261] start load image from /data/server/fe/meta/image/image.1662303. is ckpt: true
2023-07-24 12:26:58,996 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadHeader():1434] finished replay header from image
2023-07-24 12:26:58,997 INFO (leaderCheckpointer|144) [NodeMgr.loadLeaderInfo():1133] finished replay masterInfo from image
2023-07-24 12:26:59,016 INFO (leaderCheckpointer|144) [LocalMetastore.loadDb():331] finished replay databases from image
2023-07-24 12:26:59,017 INFO (leaderCheckpointer|144) [Load.loadLoadJob():986] finished replay loadJob from image
2023-07-24 12:26:59,017 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadAlterJob():1449] finished replay alterJob from image
2023-07-24 12:26:59,017 INFO (leaderCheckpointer|144) [CatalogRecycleBin.loadRecycleBin():1148] finished replay recycleBin from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [VariableMgr.loadGlobalVariable():559] finished replay globalVariable from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [LocalMetastore.loadCluster():4308] finished replay cluster from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [NodeMgr.loadBrokers():1101] finished replay brokerMgr from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadResources():1536] finished replay resources from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadResources():1538] start to replay resource mapping catalog
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [GlobalStateMgr.loadResources():1540] finished replaying resource mapping catalogs from resources
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [ExportMgr.loadExportJob():385] finished replay exportJob from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [BackupHandler.readFields():673] finished replay 0 backup/store jobs from image
2023-07-24 12:26:59,018 INFO (leaderCheckpointer|144) [BackupHandler.loadBackupHandler():686] finished replay backupHandler from image
2023-07-24 12:26:59,019 INFO (leaderCheckpointer|144) [Auth.loadAuth():1890] finished replay auth from image

但是bdb的文件依旧没有减少。目前image文件夹内:
-rw-rw-r-- 1 1000 1000 951M Apr 27 09:28 image.1662303
-rw-r–r-- 1 1000 1000 3.5M Jul 24 12:25 image.ckpt
-rw-rw-r-- 1 1000 1000 91 Jul 24 11:27 ROLE
-rw-rw-r-- 1 1000 1000 93 Apr 27 09:27 VERSION
image文件没有更新。bdb文件持续上升。

这个内容后面 还有啥日志么

2023-07-24 14:26:42,385 INFO (leaderCheckpointer|144) [BackupHandler.readFields():673] finished replay 0 backup/store jobs from image
2023-07-24 14:26:42,385 INFO (leaderCheckpointer|144) [BackupHandler.loadBackupHandler():686] finished replay backupHandler from image
2023-07-24 14:26:42,386 INFO (leaderCheckpointer|144) [Auth.loadAuth():1890] finished replay auth from image
2023-07-24 14:26:42,386 INFO (leaderCheckpointer|144) [GlobalTransactionMgr.readFields():711] discard expired transaction state: TransactionState. txn_id: 13404, label: insert_83603d17-daa3-11ed-9f1b-0242ffa99e51, db id: 10002, table id list: 10530, callback id: -1, coordinator: FE: 10.9.2.217, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1681463133701, commit time: 1681463133729, finish time: 1681463133743, write cost: 28ms, publish total cost: 14ms, total cost: 42ms, reason: attachment: com.starrocks.transaction.InsertTxnCommitAttachment@45bff2d4
2023-07-24 14:26:42,386 INFO (leaderCheckpointer|144) [GlobalTransactionMgr.readFields():711] discard expired transaction state: TransactionState. txn_id: 13405, label: insert_8368a188-daa3-11ed-9f1b-0242ffa99e51, db id: 10002, table id list: 10530, callback id: -1, coordinator: FE: 10.9.2.217, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1681463133757, commit time: 1681463133777, finish time: 1681463133791, write cost: 20ms, publish total cost: 14ms, total cost: 34ms, reason: attachment: com.starrocks.transaction.InsertTxnCommitAttachment@1a2d201a
2023-07-24 14:26:42,386 INFO (leaderCheckpointer|144) [GlobalTransactionMgr.readFields():711] discard expired transaction state: TransactionState. txn_id: 13406, label: insert_836ff489-daa3-11ed-9f1b-0242ffa99e51, db id: 10002, table id list: 10530, callback id: -1, coordinator: FE: 10.9.2.217, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1681463133804, commit time: 1681463133825, finish time: 1681463133839, write cost: 21ms, publish total cost: 14ms, total cost: 35ms, reason: attachment: com.starrocks.transaction.InsertTxnCommitAttachment@405aa875

不断打印expired transaction state这个。

10.9.2.217这些节点是已经下线drop掉了的。目前只有一个FE节点。

fe.warn能发下文件吗

fe.warn.log.tar.gz (11.9 MB)
我们目前已经调试到90G的jvm内存了。设置启动参数:
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx90g -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$STARROCKS_HOME/log/fe.gc.log.$DATE"
但是启动起来之后,jvm的内存就到了50%多了。但是集群其实没有大的请求进来的。

fe.log发下吧再

不用发了,找到原因了

fe.log.tar.gz (13.3 MB)
目前jvm的堆内存使用率在50这样了:image

2023-07-25 00:13:29,592 ERROR (leaderCheckpointer|144) [Daemon.run():117] daemon thread got exception. name: leaderCheckpointer
java.lang.OutOfMemoryError: null
at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_291]
at java.lang.StringBuffer.append(StringBuffer.java:270) ~[?:1.8.0_291]
at java.io.StringWriter.write(StringWriter.java:112) ~[?:1.8.0_291]
at com.google.gson.stream.JsonWriter.string(JsonWriter.java:584) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.stream.JsonWriter.value(JsonWriter.java:418) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.TypeAdapters$15.write(TypeAdapters.java:384) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.TypeAdapters$15.write(TypeAdapters.java:368) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:503) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:503) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:503) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:503) ~[starrocks-fe.jar:?]
at com.google.gson.Gson.toJson(Gson.java:735) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:714) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:669) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:649) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.scheduler.TaskManager.saveTasks(TaskManager.java:453) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.saveImage(GlobalStateMgr.java:1617) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.saveImage(GlobalStateMgr.java:1568) ~[starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:204) ~[starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:93) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) ~[starrocks-fe.jar:?]
at com.starrocks.common.util.Daemon.run(Daemon.java:115) [starrocks-fe.jar:?]

现在的版本是2.5.8哈?

[quote=“gengjun, post:17, topic:7976”]
OutOfMemoryError
[/quote]是2.5.4版本。

升级到2.5.8试试,这个问题修复过了

升级之后,fe.out,一直报错这个: 【2.5.8】升级后FE服务内存溢出 - :speech_balloon: StarRocks 用户问答 / 日常运维 - StarRocks中文社区论坛 (mirrorship.cn)
这个是我们另外一个集群,升级出现的问题。现在把128G主机的集群也升级了,同样出现了这个问题了。
fe.out.tar.gz (110.6 KB)

升级之后,出现了这个错误:
2023-07-25 13:31:10,150 ERROR (leaderCheckpointer|112) [Daemon.run():117] daemon thread got exception. name: leaderCheckpointer
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_291]
at java.lang.StringBuffer.append(StringBuffer.java:270) ~[?:1.8.0_291]
at java.io.StringWriter.write(StringWriter.java:112) ~[?:1.8.0_291]
at com.google.gson.stream.JsonWriter.string(JsonWriter.java:590) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.stream.JsonWriter.writeDeferredName(JsonWriter.java:401) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.stream.JsonWriter.value(JsonWriter.java:416) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.TypeAdapters$15.write(TypeAdapters.java:384) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.TypeAdapters$15.write(TypeAdapters.java:368) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:509) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:509) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:509) ~[starrocks-fe.jar:?]
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.persist.gson.GsonUtils$ProcessHookTypeAdapterFactory$1.write(GsonUtils.java:509) ~[starrocks-fe.jar:?]
at com.google.gson.Gson.toJson(Gson.java:735) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:714) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:669) ~[spark-dpp-1.0.0.jar:?]
at com.google.gson.Gson.toJson(Gson.java:649) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.scheduler.TaskManager.saveTasks(TaskManager.java:533) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.saveImage(GlobalStateMgr.java:1639) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.saveImage(GlobalStateMgr.java:1590) ~[starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.replayAndGenerateGlobalStateMgrImage(Checkpoint.java:204) ~[starrocks-fe.jar:?]
at com.starrocks.leader.Checkpoint.runAfterCatalogReady(Checkpoint.java:93) ~[starrocks-fe.jar:?]

这个报错应该是关闭了这个:echo ‘madvise’ | sudo tee /sys/kernel/mm/transparent_hugepage/enabled导致的。
现在设置为always之后,没有报错这个了。但是由报错了这个:
2023-07-25 13:32:10,272 WARN (leaderCheckpointer|112) [ColocateTableIndex.cleanupInvalidDbOrTable():989] remove 0 invalid tableid: []
2023-07-25 13:32:18,716 WARN (leaderCheckpointer|112) [GlobalStateMgr.updateBaseTableRelatedMv():1384] Setting the materialized view mv_zt_saas_public_staff_department_data(50300) to invalid because the table null was not exist.
2023-07-25 13:32:18,716 WARN (leaderCheckpointer|112) [GlobalStateMgr.updateBaseTableRelatedMv():1384] Setting the materialized view mv_zt_saas_public_staff_department_data(50300) to invalid because the table null was not exist.
2023-07-25 13:32:21,999 WARN (leaderCheckpointer|112) [OlapTable.onDrop():2077] Ignore materialized view MvId{dbId=10616, id=158009} does not exists
2023-07-25 13:32:21,999 WARN (leaderCheckpointer|112) [OlapTable.onDrop():2077] Ignore materialized view MvId{dbId=10616, id=158112} does not exists
2023-07-25 13:32:39,565 WARN (leaderCheckpointer|112) [OlapTable.onDrop():2077] Ignore materialized view MvId{dbId=10616, id=158370} does not exists
2023-07-25 13:32:40,201 WARN (leaderCheckpointer|112) [OlapTable.onDrop():2077] Ignore materialized view MvId{dbId=10616, id=159770} does not exists
2023-07-25 13:32:40,250 WARN (leaderCheckpointer|112) [OlapTable.onDrop():2077] Ignore materialized view MvId{dbId=10616, id=160850} does not exists
麻烦看下,这个是什么问题?

继续升级到了2.5.9版本。设置FE的内存为110G。主机128G。但是依旧出现:
java.lang.OutOfMemoryError: null
这个需要多大的内存?

还是在saveTask的时候出错了吗

在拉起checkpoints的daemon的时候,报错。
2023-07-26 13:31:48,478 ERROR (leaderCheckpointer|112) [Daemon.run():117] daemon thread got exception. name: leaderCheckpointer
java.lang.OutOfMemoryError: null
at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125) ~[?:1.8.0_291]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_291]
at java.lang.StringBuffer.append(StringBuffer.java:270) ~[?:1.8.0_291]

这个错误,是不是由于bdb的数据库太多了,加载的数量超过了hugeCapacity这个定义的大小,导致的oom了?
这里有一个搜到的资料:https://blog.51cto.com/u_15696939/5415367

你这个问题 解决了吗。我也是缩减节点后,bdb目录积压了大量的文件