BESA: Pruning Large Language Models with Blockwise Parameter-Efficient   Sparsity Allocation

Peng Xu; Wenqi Shao; Mengzhao Chen; Shitao Tang; Kaipeng Zhang; Peng; Gao; Fengwei An; Yu Qiao; Ping Luo

arXiv:2402.16880·cs.LG·April 22, 2024·3 cites

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng, Gao, Fengwei An, Yu Qiao, Ping Luo

PDF

Open Access 2 Repos

TL;DR

BESA is a novel pruning method for large language models that allocates sparsity across transformer blocks to minimize performance loss, enabling efficient pruning of models like LLaMA with high accuracy and speed.

Contribution

Introduces BESA, a blockwise, differentiable sparsity allocation method that improves pruning efficiency and reduces performance degradation in large language models.

Findings

01

BESA achieves state-of-the-art pruning performance on LLaMA models.

02

It prunes models from 7B to 70B parameters within five hours on a single GPU.

03

The method maintains high model accuracy post-pruning.

Abstract

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsPruning