SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe

TL;DR
This paper introduces SPQ, an ensemble compression method combining SVD, pruning, and quantization, significantly reducing memory usage of large language models while maintaining or improving performance and inference speed.
Contribution
The paper proposes a novel ensemble compression technique, SPQ, that outperforms individual methods in LLM compression, enabling efficient deployment with minimal performance loss.
Findings
Achieves up to 75% memory reduction on LLaMA-2-7B.
Maintains or improves perplexity and downstream task accuracy.
Provides up to 1.9x inference speedup compared to baselines.
Abstract
This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
