Data-Free Pruning of Self-Attention Layers in LLMs

Dhananjay Saikumar; Blesson Varghese

arXiv:2512.20636·cs.LG·December 25, 2025

Data-Free Pruning of Self-Attention Layers in LLMs

Dhananjay Saikumar, Blesson Varghese

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Gate-Norm, a fast, data-free method for pruning self-attention layers in large language models, significantly improving inference speed with minimal accuracy loss.

Contribution

Gate-Norm provides a one-shot, weight-only criterion for pruning attention layers without calibration data or fine-tuning, enabling practical compression of LLMs.

Findings

01

Prunes 8-16 attention layers with up to 1.30x speedup.

02

Maintains within 2% of baseline accuracy on multiple benchmarks.

03

Pruning is achieved in under a second on large models.

Abstract

Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query--key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning $8$ -- $16$ attention sublayers yields up to $1.30 \times$ higher inference throughput while keeping average zero-shot accuracy within $2%$ of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The method's greatest strength is its speed and simplicity. Being data-free, one-shot, and running in milliseconds (~1000x faster than alternatives) makes it incredibly practical for on-the-fly, on-device compression without the massive overhead of data-driven approaches. 2. The entire process requires no calibration datasets, no GPUs for the pruning step itself, and no costly post-pruning fine-tuning. This makes the method highly accessible and easy to deploy.

Weaknesses

1. The method focuses exclusively on attention sublayers. The paper itself notes that MLP layers have twice the parameters and also contribute to runtime, but they are not targeted for pruning by this method. 2. While the results support the hypothesis, the introduction doesn't provide direct evidence for "attention suppression" itself (e.g., by showing near-zero output norms from the targeted layers during inference). The claim rests on the method's success. 3. The criterion for pruning is st

Reviewer 02Rating 6Confidence 3

Strengths

1. The Gate‑Norm proxy is conceptually simple and easy to implement. It uses only the trained query and key matrices and does not require any calibration data or activation statistics. 2. Experiments demonstrate that pruning 8–16 attention layers yields 1.1–1.3× higher inference throughput while reducing average zero‑shot accuracy by at most ~2 %. The method thus offers a promising depth‑compression strategy for LLM deployment, particularly on devices without GPUs or with strict latency and pri

Weaknesses

1. The experiments focus exclusively on two 13B‑parameter LLaMA models. It remains unclear whether the Attention Suppression phenomenon and the gate‑norm proxy generalize to other model families (e.g., Qwen, Mistral, smaller or larger LLaMA variants) 2. The proposed algorithm either keeps or completely disables an attention sublayer. Finer‑grained options (e.g., partial gating, head pruning) are not explored. 3. The "no fine-tuning" aspect is a strength for one-shot pruning. However, the paper d

Reviewer 03Rating 4Confidence 5

Strengths

- **Clear and easy to follow.** The paper is generally well structured and readable. The ideas flow logically, and the intuition behind the method is explained in a straightforward way. - **Simple and data-free approach.** The pruning method doesn’t rely on any additional data or fine-tuning, which makes it simple and relatively easy to implement. The computational cost also seems quite low, even for large models. - **Reasonably practical and scalable.** The approach looks practical enough to be

Weaknesses

- **Key Concern.** The method is mainly built on the attention suppression phenomenon, which the authors identify through large-scale empirical analysis (e.g., Fig. 3). But this raises a question: if the paper already measures how cosine similarity and norm changes evolve across layers, why not just use those observed metrics directly to decide which layers to prune? In other words, instead of introducing the Gate-Norm as a new proxy, it might be more straightforward to prune layers that already

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy