ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models
Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder

TL;DR
ALPS is an optimization-based pruning framework for large language models that significantly improves sparsity and performance, especially at high sparsity levels, by leveraging advanced optimization techniques and GPU acceleration.
Contribution
ALPS introduces a novel optimization-based approach for one-shot pruning of large language models, outperforming heuristic methods in achieving higher sparsity and better model performance.
Findings
Achieves 70% sparsity with 13% perplexity reduction on WikiText
Outperforms state-of-the-art methods in zero-shot benchmarks
Provides theoretical convergence guarantees for pruning
Abstract
The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsPruning
