SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight   Compression

Tim Dettmers; Ruslan Svirschevski; Vage Egiazarian; Denis Kuznedelev,; Elias Frantar; Saleh Ashkboos; Alexander Borzunov; Torsten Hoefler; Dan; Alistarh

arXiv:2306.03078·cs.CL·June 6, 2023·25 cites

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev,, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan, Alistarh

PDF

Open Access 1 Repo 2 Videos

TL;DR

SpQR is a novel compression method for large language models that achieves near-lossless accuracy by isolating outliers and compressing the rest, enabling efficient deployment on consumer hardware with minimal accuracy loss.

Contribution

Introduces SpQR, a new sparse-quantized format that enables near-lossless compression of LLMs by isolating outliers and compressing remaining weights, improving deployment efficiency.

Findings

01

Achieves less than 1% perplexity loss on LLaMA and Falcon models.

02

Enables running 33B parameter models on a 24 GB GPU with 15% speedup.

03

Provides efficient encoding and decoding algorithms for SpQR.

Abstract

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vahe1994/spqr
pytorchOfficial

Videos

The AI News You Might Have Missed This Week - Zuckerberg to Falcon w/ SPQR· youtube

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression· slideslive

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Ferroelectric and Negative Capacitance Devices