Data-Free Pruning of Self-Attention Layers in LLMs
Dhananjay Saikumar, Blesson Varghese

TL;DR
This paper introduces Gate-Norm, a fast, data-free method for pruning self-attention layers in large language models, significantly improving inference speed with minimal accuracy loss.
Contribution
Gate-Norm provides a one-shot, weight-only criterion for pruning attention layers without calibration data or fine-tuning, enabling practical compression of LLMs.
Findings
Prunes 8-16 attention layers with up to 1.30x speedup.
Maintains within 2% of baseline accuracy on multiple benchmarks.
Pruning is achieved in under a second on large models.
Abstract
Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query--key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning -- attention sublayers yields up to higher inference throughput while keeping average zero-shot accuracy within of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The method's greatest strength is its speed and simplicity. Being data-free, one-shot, and running in milliseconds (~1000x faster than alternatives) makes it incredibly practical for on-the-fly, on-device compression without the massive overhead of data-driven approaches. 2. The entire process requires no calibration datasets, no GPUs for the pruning step itself, and no costly post-pruning fine-tuning. This makes the method highly accessible and easy to deploy.
1. The method focuses exclusively on attention sublayers. The paper itself notes that MLP layers have twice the parameters and also contribute to runtime, but they are not targeted for pruning by this method. 2. While the results support the hypothesis, the introduction doesn't provide direct evidence for "attention suppression" itself (e.g., by showing near-zero output norms from the targeted layers during inference). The claim rests on the method's success. 3. The criterion for pruning is st
1. The Gate‑Norm proxy is conceptually simple and easy to implement. It uses only the trained query and key matrices and does not require any calibration data or activation statistics. 2. Experiments demonstrate that pruning 8–16 attention layers yields 1.1–1.3× higher inference throughput while reducing average zero‑shot accuracy by at most ~2 %. The method thus offers a promising depth‑compression strategy for LLM deployment, particularly on devices without GPUs or with strict latency and pri
1. The experiments focus exclusively on two 13B‑parameter LLaMA models. It remains unclear whether the Attention Suppression phenomenon and the gate‑norm proxy generalize to other model families (e.g., Qwen, Mistral, smaller or larger LLaMA variants) 2. The proposed algorithm either keeps or completely disables an attention sublayer. Finer‑grained options (e.g., partial gating, head pruning) are not explored. 3. The "no fine-tuning" aspect is a strength for one-shot pruning. However, the paper d
- **Clear and easy to follow.** The paper is generally well structured and readable. The ideas flow logically, and the intuition behind the method is explained in a straightforward way. - **Simple and data-free approach.** The pruning method doesn’t rely on any additional data or fine-tuning, which makes it simple and relatively easy to implement. The computational cost also seems quite low, even for large models. - **Reasonably practical and scalable.** The approach looks practical enough to be
- **Key Concern.** The method is mainly built on the attention suppression phenomenon, which the authors identify through large-scale empirical analysis (e.g., Fig. 3). But this raises a question: if the paper already measures how cosine similarity and norm changes evolve across layers, why not just use those observed metrics directly to decide which layers to prune? In other words, instead of introducing the Gate-Norm as a new proxy, it might be more straightforward to prune layers that already
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy
