Pruning Foundation Models for High Accuracy without Retraining
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang,, Xue Lin

TL;DR
This paper introduces a novel post-training pruning method for large language models that achieves high accuracy without retraining, reducing model size efficiently while maintaining performance.
Contribution
It formulates a layer-wise pruning problem for LLMs, provides an optimal solution, and designs a pruning algorithm for both unstructured and semi-structured sparsity, outperforming state-of-the-art methods.
Findings
Superior performance over SOTA baselines across various LLMs
Effective one-shot pruning without retraining
Maintains high accuracy with reduced model size
Abstract
Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStructural Health Monitoring Techniques
MethodsPruning
