Merging Feed-Forward Sublayers for Compressed Transformers

Neha Verma; Kenton Murray; Kevin Duh

arXiv:2501.06126·cs.CL·April 1, 2025

Merging Feed-Forward Sublayers for Compressed Transformers

Neha Verma, Kenton Murray, Kevin Duh

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces a novel method for compressing Transformer models by merging similar feed-forward sublayers, reducing parameters significantly while maintaining high performance across various tasks.

Contribution

The paper proposes a new approach to model compression by merging feed-forward sublayers, achieving high compression rates with minimal performance loss.

Findings

01

Merged over a third of feed-forward sublayers in models.

02

Maintained 99% of original performance after 21% parameter reduction.

03

Outperformed layer-pruning baseline in experiments.

Abstract

With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

The paper is well-written with a clear and logical structure. The paper presents a novel method to reduce the storage costs of deep Transformer-based models by merging their parameters. The experimental results provide a detailed discussion of parameter merging across multiple deep models on various tasks, demonstrating the effectiveness of the proposed approach.

Weaknesses

As shown in Table 1, parameter merging maintains the model's inference speed but still requires fine-tuning, highlighting the drawbacks of this approach. Despite the distinct from parameter pruning, parameter merging/sharing remains a common model compression technique. However, the paper lacks of experimental comparison and discussion with other parameter pruning methods, such as [1], weakens the argument presented in this paper. Notably, [1] achieves a nearly unchanged ViT accuracy (-0.07, 83.

Reviewer 02Rating 3Confidence 5

Strengths

- The paper does a good job delineating relevant context for neuron alignment and describing their approach. - Also, the summary of the comparison between compression methods helps understand the trade-off of the merging method. - Thorough analysis via ablation studies and visualization.

Weaknesses

The biggest weakness is the experimental results. It seems like authors do a great job at the ablation studies and visualization, but these are secondary contributions given that this is a paper on compression method for Transformer acceleration, not interpretability research. This means the results section should cover a wider range of benchmarks and also comparisons to pruning approaches (which achieves the same end effect as merging). For example, Wanda [1] prunes 50% at one-shot (without fin

Reviewer 03Rating 3Confidence 4

Strengths

This paper explores layer merging, which is an interesting idea. The proposed method of permutation merging provides more capacity to the model after merging. The experiments are conducted on both vision transformer models and language models Detailed results are provided on merging different amount of layers and different layer locations.

Weaknesses

Novelty-wise, weight sharing across layer is not a new concept. Early efficient language model design has explored to share weights across different transformer blocks [1], with later attemps conducted in ViTs and LLMs. Even as a new model compression method, the proposed method seems to be not very effective, especially comparing to pruning. For example, structural pruning can achieve 2.57x lossless parameter reduction on ViT model [2], yet the proposed method can only remove 21%. Furthermore,

Reviewer 04Rating 3Confidence 5

Strengths

- The paper presents a clear and straightforward idea. The authors propose reusing the Feed-Forward Network (FFN) layers in Transformer models, which makes the paper relatively easy to understand. The main novelty comes from averaging FFN weights after applying permutation to align them across layers. - The proposed method demonstrates that model compression can be achieved by reducing the number of stored parameters through merging FFN layers. This can be useful for reducing memory usage in mod

Weaknesses

- **Limited Practical Use:** The approach only reduces the number of stored parameters without reducing computational cost (no FLOP reduction). This is a significant limitation because many existing compression techniques like pruning aim to reduce both memory and computation, enabling models to run on resource-constrained devices with lower latency. The authors' method, while helpful in reducing memory, doesn't address this more practical need, limiting its applicability. - **Lack of Comprehens

Code & Models

Repositories

nverma1/merging-ffs-compression
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Semiconductor Devices and Circuit Design

MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Vision Transformer · Multi-Head Attention