BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng, Gao, Fengwei An, Yu Qiao, Ping Luo

TL;DR
BESA is a novel pruning method for large language models that allocates sparsity across transformer blocks to minimize performance loss, enabling efficient pruning of models like LLaMA with high accuracy and speed.
Contribution
Introduces BESA, a blockwise, differentiable sparsity allocation method that improves pruning efficiency and reduces performance degradation in large language models.
Findings
BESA achieves state-of-the-art pruning performance on LLaMA models.
It prunes models from 7B to 70B parameters within five hours on a single GPU.
The method maintains high model accuracy post-pruning.
Abstract
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsPruning
