What Matters in Transformers? Not All Attention is Needed

Shwai He; Guoheng Sun; Zheyu Shen; Ang Li

arXiv:2406.15786·cs.LG·October 18, 2024·3 cites

What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

PDF

Open Access 2 Repos 4 Reviews

TL;DR

This paper investigates redundancy in Transformer models, revealing that many attention layers can be pruned without significant performance loss, leading to more efficient architectures.

Contribution

It introduces a similarity-based metric to analyze redundancy across Transformer modules and proposes a joint layer dropping method for improved efficiency.

Findings

01

High redundancy in attention layers allows pruning without performance loss.

02

Pruning half of attention layers yields 48.4% speedup with only 2.4% performance drop.

03

Joint dropping of attention and MLP layers can retain 90% performance while removing 31 layers.

Abstract

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

- The paper is well-organized and easy to follow. - The study on Attention redundancy is interesting. The proposed approach is reasonable.

Weaknesses

- The results show that Attention layers and last layers have more redundancy. However, the experiments are mainly conducted on Llama and Mistral. Is this finding valid for other LLMs? It would be interesting to show more LLMs. - Please further clarify how the dropped layers are selected in the pruning process. If the target is to drop 4 layers, the attention layers are tested one by one to find 4 layers with lowest importance score?

Reviewer 02Rating 5Confidence 5

Strengths

1. The paper proposed an detailed method for Attention and MLP pruning. 2. The paper is well-written and easy to understand.

Weaknesses

1. The comtribution of this paper is limited. There are many previous similar methods in layer pruning [1][2][3], and this paper is a simple extension to the MLP and attention layers. 2. The author did not analyze the reason for the layer redundancy. Although extensive experiments are provided, the reason why transformer-based LLMs exhibit redundency on the MLP and Attention layers are not explained. 3. The author did not compare the experimental performance with previous block or layer pruning

Reviewer 03Rating 5Confidence 4

Strengths

- The paper is clearly written and well organized, it provides a systematic exploration of redundancy in Transformers, focusing on Blocks, MLP, and Attention layers. - This paper provides several useful insights, for example: - FFNs seem more important and Attention modules can be dropped with minimal performance impact with high efficiency - deeper layers seem less important compared t the shallower ones, which indicated the model has obtained anwsers in early layers. - The findings in this p

Weaknesses

- All experiments in this paper are conducted on a group of datasets, however these datasets are still limited and cannot represent the real-world applications and validate the generalization ability of the pruned model. For example, if the input sequence is not short, early layer attention modules can model the token-wise relationships and predict correct anwsers, but long sequence tasks such as needle in a haystack might be seriously affected by the dropping. - In addition, the importance scor

Reviewer 04Rating 6Confidence 4

Strengths

The paper is well written, it is easy to follow and the claims are backed with empirical results, latency measurements and visual insights. The method is simple and looks easy enough to implement and replicate.

Weaknesses

The novelty of the paper is limited. The paper's main findings (that attention is more easily pruned than MLP, and that shallow layers are more easily pruned than first and last layers) are known, see for example "A deeper look at depth pruning of LLMs" (https://arxiv.org/pdf/2407.16286). The importance measure is using the scale-invariant cosine similarity measure. It could be argued that this fails to capture magnitude information. Since the cosine similarity measure only depends on the orien

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Pruning