Semantic Retention and Extreme Compression in LLMs: Can We Have Both?
Stanislas Laborde, Martin Cousseau, Antoun Yaacoub, Lionel Prevost

TL;DR
This paper explores combining pruning and quantization for LLM compression, introducing a new metric to balance semantic retention and compression, and demonstrates improved performance over single-method approaches.
Contribution
It presents a novel joint compression framework for LLMs, including a new metric SrCr to evaluate semantic retention, and shows superior performance-to-compression ratios.
Findings
20% performance increase over quantization-only models at same compression rate
Introduction of SrCr metric for better evaluation of semantic retention
Joint pruning and quantization outperform single-method compression approaches
Abstract
The exponential growth in Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques to reduce computational and memory costs. While pruning and quantization have shown promise, their combined potential remains largely unexplored. In this paper, we examine joint compression and how strategically combining pruning and quantization could yield superior performance-to-compression ratios compared to single-method approaches. Recognizing the challenges in accurately assessing LLM performance, we address key limitations of previous evaluation frameworks and introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation, facilitating the optimization of pruning-quantization configurations. Experiments demonstrate that our recommended combination achieves,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsPruning
