QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models
Changhai Zhou, Yuhua Zhou, Shijie Han, Qian Qiao and, Hongguang Li

TL;DR
QPruner introduces a layer-wise mixed-precision quantization framework combined with structured pruning and Bayesian optimization to reduce memory usage in large language models while preserving or enhancing accuracy.
Contribution
It presents a novel approach integrating quantization with structured pruning and Bayesian optimization for improved memory efficiency in LLMs.
Findings
Significant memory savings compared to existing methods.
Maintains or improves model accuracy after pruning and quantization.
Effective layer importance estimation for precision allocation.
Abstract
The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsPruning
