High-Layer Attention Pruning with Rescaling

Songtao Liu; Peng Liu

arXiv:2507.01900·cs.CL·January 28, 2026

High-Layer Attention Pruning with Rescaling

Songtao Liu, Peng Liu

PDF

Open Access

TL;DR

This paper introduces a novel attention head pruning method for large language models that strategically prunes higher-layer heads and uses adaptive rescaling to maintain performance, resulting in superior compression and task performance.

Contribution

The paper proposes a new pruning algorithm that selectively prunes higher-layer attention heads and employs adaptive rescaling to improve model compression without sacrificing accuracy.

Findings

01

Outperforms existing pruning methods across multiple LLMs.

02

Significantly improves generation task performance.

03

Effective across diverse datasets and tasks.

Abstract

Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare