Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
Jianwei Li, Yijun Dong, Qi Lei

TL;DR
This paper introduces a novel single-shot pruning method for large language models that avoids retraining, using a simplified structure and inference-aware criteria to efficiently reduce model size while preserving performance.
Contribution
It proposes a depth-2 pruning structure and inference-aware criteria for efficient, retraining-free structured pruning of LLMs, improving over traditional metrics.
Findings
Significantly reduces computational costs and hardware needs.
Maintains high performance across multiple datasets and models.
Outperforms traditional training-aware pruning metrics.
Abstract
To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression
MethodsPruning
