FE 节点频繁OOM

【详述】写入过程发现thrift-server-pool飙升达到最高,然后 fe 频繁OOM出现,写入过程有flink和insert overwrite select * from hive表,大可能是后者引起OOM, starrocks JVM XMX设置32Goom ,设置64G依然OOM, 经过自己分析大可能是内存泄露
【背景】尝试了3.1.11,3.2.6,3.2.7的版本,依然有这样的问题
【业务影响】
【是否存算分离】否
【StarRocks版本】3.1.11,3.2.6,3.2.7
【集群规模】3fe+80be
【机器信息】40C192G 12*1T SSD
【联系方式】论坛
【附件】

heapdump过大超过32G,采用ParseHeapDump.sh heapdump.hprof org.eclipse.mat.api:suspects org.eclipse.mat.api:overview org.eclipse.mat.api:top_components的结果
java_pid40682_Leak_Suspects.zip (224.5 KB) [java_pid40682_System_Overview.zip|attachment]java_pid40682_Top_Components.zip (330.5 KB) (upload://qsHhjLEDOZ9V0doupMKPP4o9aiF.zip) (135.7 KB)

相关工具 mat分析截图:




跟进这个大的对像,发现每个thrift-server-pool都有一份一模一样的olaptable数据应该要复用
对像大致内容:
{
“comment”: “”,
“indexIdToMeta”: {
“12219”: {
“schemaHash”: 1730590400,
“storageType”: “COLUMN”,
“keysType”: “DUP_KEYS”,
“schemaId”: 12219,
“indexId”: 12219,
“isColocateMVIndex”: false,
“shortKeyColumnCount”: 3,
“dbId”: 0,
“schemaVersion”: 0,
“schema”: [
{
“comment”: “”,
“isAutoIncrement”: false,
“stats”: {
“maxSize”: -1,
“numDistinctValues”: -1,
“avgSerializedSize”: -1.0,
“numNulls”: -1
“lastFailedVersion”: -1,
“lastSuccessVersion”: 2,
“backendId”: 10011,
“rowCount”: 211135,
“state”: “DECOMMISSION”,
“version”: 2,
“dataSize”: 20133914,
“id”: 2721133,
“minReadableVersion”: 0
},
{
“lastFailedVersion”: -1,
“lastSuccessVersion”: 2,
“backendId”: 10332,
“rowCount”: 211135,
“state”: “NORMAL”,
“version”: 2,
“dataSize”: 20136612,
“id”: 2730970,
“minReadableVersion”: 1
}
],
“clazz”: “LocalTablet”,
“signature”: -1,
“id”: 2721130,
“checkedVersion”: 2
},
以上这样的数据,有几十万份,严重消耗内存

感谢反馈,我们看下怎么优化一下

方便加下微信吗?heap dump 如果比较大,可以压缩一下,这个有比较好的压缩比,然后我们这边有大内存机器,可以分析

好的,微信先加你了

请问是如何解决的,我们也遇到了同样的问题。

不好意思,才看到,你们是什么版本?新版本3.3x之后已经修复部分,但没有很彻底。根本原因是每个线程都会深度copy一个表的所有tablets信息,实际上对于insert overwrite根本用不到tablets信息,当一个表的tablets过多的时候,会占用很多内存,改的话这个地方我们是将不必要的信息不copy:
使用 selectiveCopyWithoutTablets 代替addPartitions和getCopiedTable里面的 方法selectiveCopy

       // We don't do deep copy, because which is very expensive;
    public void copyWithoutTablets(OlapTable olapTable) {
        olapTable.id = this.id;
        olapTable.name = this.name;
        olapTable.fullSchema = Lists.newArrayList(this.fullSchema);
        olapTable.nameToColumn = Maps.newHashMap(this.nameToColumn);
        olapTable.state = this.state;
        olapTable.indexNameToId = Maps.newHashMap(this.indexNameToId);
        olapTable.indexIdToMeta = Maps.newHashMap(this.indexIdToMeta);
        olapTable.keysType = this.keysType;
        if (this.relatedMaterializedViews != null) {
            olapTable.relatedMaterializedViews = Sets.newHashSet(this.relatedMaterializedViews);
        }
        if (this.uniqueConstraints != null) {
            olapTable.uniqueConstraints = Lists.newArrayList(this.uniqueConstraints);
        }
        if (this.foreignKeyConstraints != null) {
            olapTable.foreignKeyConstraints = Lists.newArrayList(this.foreignKeyConstraints);
        }
        if (this.partitionInfo != null) {
            olapTable.partitionInfo = DeepCopy.copyWithGson(this.partitionInfo, PartitionInfo.class);
        }
        olapTable.defaultDistributionInfo = this.defaultDistributionInfo;
        Map<Long, Partition> idToPartitions = new HashMap<>();
        Map<String, Partition> nameToPartitions = Maps.newTreeMap(String.CASE_INSENSITIVE_ORDER);
        olapTable.idToPartition = idToPartitions;
        olapTable.nameToPartition = nameToPartitions;
        olapTable.tempPartitions = new TempPartitions();
        for (Partition tempPartition : this.getTempPartitions()) {
            olapTable.tempPartitions.addPartition(tempPartition.shallowCopy());
        }
        olapTable.baseIndexId = this.baseIndexId;
        if (this.tableProperty != null) {
            olapTable.tableProperty = this.tableProperty.copy();
        }

        // Shallow copy shared data to check whether the copied table has changed or not.
        olapTable.lastSchemaUpdateTime = this.lastSchemaUpdateTime;
        olapTable.lastVersionUpdateStartTime = this.lastVersionUpdateStartTime;
        olapTable.lastVersionUpdateEndTime = this.lastVersionUpdateEndTime;
    }
      public OlapTable selectiveCopyWithoutTablets(Collection<String> reservedPartitions,
                                                 boolean resetState, IndexExtState extState) {
        OlapTable copied = DeepCopy.copyWithGson(this, OlapTable.class);
        OlapTable copied = new OlapTable();
        this.copyWithoutTablets(copied);
        if (copied == null) {
            LOG.warn("failed to copy olap table: " + getName());
            return null;
        }
        return selectiveCopyInternal(copied, reservedPartitions, resetState, extState);
    }