SPQ: An Ensemble Technique for Large Language Model Compression

Jiamin Yao; Eren Gultepe

arXiv:2602.18420·cs.CL·February 23, 2026

SPQ: An Ensemble Technique for Large Language Model Compression

Jiamin Yao, Eren Gultepe

PDF

Open Access

TL;DR

This paper introduces SPQ, an ensemble compression method combining SVD, pruning, and quantization, significantly reducing memory usage of large language models while maintaining or improving performance and inference speed.

Contribution

The paper proposes a novel ensemble compression technique, SPQ, that outperforms individual methods in LLM compression, enabling efficient deployment with minimal performance loss.

Findings

01

Achieves up to 75% memory reduction on LLaMA-2-7B.

02

Maintains or improves perplexity and downstream task accuracy.

03

Provides up to 1.9x inference speedup compared to baselines.

Abstract

This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling