What Matters in Transformers? Not All Attention is Needed
Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

TL;DR
This paper investigates redundancy in Transformer models, revealing that many attention layers can be pruned without significant performance loss, leading to more efficient architectures.
Contribution
It introduces a similarity-based metric to analyze redundancy across Transformer modules and proposes a joint layer dropping method for improved efficiency.
Findings
High redundancy in attention layers allows pruning without performance loss.
Pruning half of attention layers yields 48.4% speedup with only 2.4% performance drop.
Joint dropping of attention and MLP layers can retain 90% performance while removing 31 layers.
Abstract
While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is well-organized and easy to follow. - The study on Attention redundancy is interesting. The proposed approach is reasonable.
- The results show that Attention layers and last layers have more redundancy. However, the experiments are mainly conducted on Llama and Mistral. Is this finding valid for other LLMs? It would be interesting to show more LLMs. - Please further clarify how the dropped layers are selected in the pruning process. If the target is to drop 4 layers, the attention layers are tested one by one to find 4 layers with lowest importance score?
1. The paper proposed an detailed method for Attention and MLP pruning. 2. The paper is well-written and easy to understand.
1. The comtribution of this paper is limited. There are many previous similar methods in layer pruning [1][2][3], and this paper is a simple extension to the MLP and attention layers. 2. The author did not analyze the reason for the layer redundancy. Although extensive experiments are provided, the reason why transformer-based LLMs exhibit redundency on the MLP and Attention layers are not explained. 3. The author did not compare the experimental performance with previous block or layer pruning
- The paper is clearly written and well organized, it provides a systematic exploration of redundancy in Transformers, focusing on Blocks, MLP, and Attention layers. - This paper provides several useful insights, for example: - FFNs seem more important and Attention modules can be dropped with minimal performance impact with high efficiency - deeper layers seem less important compared t the shallower ones, which indicated the model has obtained anwsers in early layers. - The findings in this p
- All experiments in this paper are conducted on a group of datasets, however these datasets are still limited and cannot represent the real-world applications and validate the generalization ability of the pruned model. For example, if the input sequence is not short, early layer attention modules can model the token-wise relationships and predict correct anwsers, but long sequence tasks such as needle in a haystack might be seriously affected by the dropping. - In addition, the importance scor
The paper is well written, it is easy to follow and the claims are backed with empirical results, latency measurements and visual insights. The method is simple and looks easy enough to implement and replicate.
The novelty of the paper is limited. The paper's main findings (that attention is more easily pruned than MLP, and that shallow layers are more easily pruned than first and last layers) are known, see for example "A deeper look at depth pruning of LLMs" (https://arxiv.org/pdf/2407.16286). The importance measure is using the scale-invariant cosine similarity measure. It could be argued that this fails to capture magnitude information. Since the cosine similarity measure only depends on the orien
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Pruning
