Greedy Output Approximation: Towards Efficient Structured Pruning for   LLMs Without Retraining

Jianwei Li; Yijun Dong; Qi Lei

arXiv:2407.19126·cs.AI·July 30, 2024

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Jianwei Li, Yijun Dong, Qi Lei

PDF

Open Access

TL;DR

This paper introduces a novel single-shot pruning method for large language models that avoids retraining, using a simplified structure and inference-aware criteria to efficiently reduce model size while preserving performance.

Contribution

It proposes a depth-2 pruning structure and inference-aware criteria for efficient, retraining-free structured pruning of LLMs, improving over traditional metrics.

Findings

01

Significantly reduces computational costs and hardware needs.

02

Maintains high performance across multiple datasets and models.

03

Outperforms traditional training-aware pruning metrics.

Abstract

To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression

MethodsPruning