High-Layer Attention Pruning with Rescaling
Songtao Liu, Peng Liu

TL;DR
This paper introduces a novel attention head pruning method for large language models that strategically prunes higher-layer heads and uses adaptive rescaling to maintain performance, resulting in superior compression and task performance.
Contribution
The paper proposes a new pruning algorithm that selectively prunes higher-layer attention heads and employs adaptive rescaling to improve model compression without sacrificing accuracy.
Findings
Outperforms existing pruning methods across multiple LLMs.
Significantly improves generation task performance.
Effective across diverse datasets and tasks.
Abstract
Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
