Merging Feed-Forward Sublayers for Compressed Transformers
Neha Verma, Kenton Murray, Kevin Duh

TL;DR
This paper introduces a novel method for compressing Transformer models by merging similar feed-forward sublayers, reducing parameters significantly while maintaining high performance across various tasks.
Contribution
The paper proposes a new approach to model compression by merging feed-forward sublayers, achieving high compression rates with minimal performance loss.
Findings
Merged over a third of feed-forward sublayers in models.
Maintained 99% of original performance after 21% parameter reduction.
Outperformed layer-pruning baseline in experiments.
Abstract
With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper is well-written with a clear and logical structure. The paper presents a novel method to reduce the storage costs of deep Transformer-based models by merging their parameters. The experimental results provide a detailed discussion of parameter merging across multiple deep models on various tasks, demonstrating the effectiveness of the proposed approach.
As shown in Table 1, parameter merging maintains the model's inference speed but still requires fine-tuning, highlighting the drawbacks of this approach. Despite the distinct from parameter pruning, parameter merging/sharing remains a common model compression technique. However, the paper lacks of experimental comparison and discussion with other parameter pruning methods, such as [1], weakens the argument presented in this paper. Notably, [1] achieves a nearly unchanged ViT accuracy (-0.07, 83.
- The paper does a good job delineating relevant context for neuron alignment and describing their approach. - Also, the summary of the comparison between compression methods helps understand the trade-off of the merging method. - Thorough analysis via ablation studies and visualization.
The biggest weakness is the experimental results. It seems like authors do a great job at the ablation studies and visualization, but these are secondary contributions given that this is a paper on compression method for Transformer acceleration, not interpretability research. This means the results section should cover a wider range of benchmarks and also comparisons to pruning approaches (which achieves the same end effect as merging). For example, Wanda [1] prunes 50% at one-shot (without fin
This paper explores layer merging, which is an interesting idea. The proposed method of permutation merging provides more capacity to the model after merging. The experiments are conducted on both vision transformer models and language models Detailed results are provided on merging different amount of layers and different layer locations.
Novelty-wise, weight sharing across layer is not a new concept. Early efficient language model design has explored to share weights across different transformer blocks [1], with later attemps conducted in ViTs and LLMs. Even as a new model compression method, the proposed method seems to be not very effective, especially comparing to pruning. For example, structural pruning can achieve 2.57x lossless parameter reduction on ViT model [2], yet the proposed method can only remove 21%. Furthermore,
- The paper presents a clear and straightforward idea. The authors propose reusing the Feed-Forward Network (FFN) layers in Transformer models, which makes the paper relatively easy to understand. The main novelty comes from averaging FFN weights after applying permutation to align them across layers. - The proposed method demonstrates that model compression can be achieved by reducing the number of stored parameters through merging FFN layers. This can be useful for reducing memory usage in mod
- **Limited Practical Use:** The approach only reduces the number of stored parameters without reducing computational cost (no FLOP reduction). This is a significant limitation because many existing compression techniques like pruning aim to reduce both memory and computation, enabling models to run on resource-constrained devices with lower latency. The authors' method, while helpful in reducing memory, doesn't address this more practical need, limiting its applicability. - **Lack of Comprehens
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Semiconductor Devices and Circuit Design
MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Vision Transformer · Multi-Head Attention
