SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng, Shen, Michael W. Mahoney, Kurt Keutzer

TL;DR
SqueezeLLM introduces a novel post-training quantization framework that enables ultra-low precision (up to 3-bit) compression of large language models, significantly reducing memory bandwidth bottlenecks and accelerating inference without performance loss.
Contribution
The paper presents a new quantization method combining sensitivity-based non-uniform quantization and Dense-and-Sparse decomposition, achieving lossless compression and improved inference speed for LLMs.
Findings
3-bit quantization reduces perplexity gap by up to 2.1x
Models achieve up to 2.3x speedup on GPU
Outperforms state-of-the-art quantization methods
Abstract
Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗squeeze-ai-lab/sq-llama-7b-w3-s0model· ♡ 2♡ 2
- 🤗squeeze-ai-lab/sq-llama-7b-w4-s0model· ♡ 1♡ 1
- 🤗squeeze-ai-lab/sq-llama-13b-w3-s0model· ♡ 1♡ 1
- 🤗squeeze-ai-lab/sq-llama-13b-w4-s0model· ♡ 1♡ 1
- 🤗squeeze-ai-lab/sq-vicuna-13b-w3-s0model· ♡ 2♡ 2
- 🤗squeeze-ai-lab/sq-vicuna-7b-w4-s0model· ♡ 1♡ 1
- 🤗squeeze-ai-lab/sq-vicuna-7b-w3-s0model· ♡ 2♡ 2
- 🤗squeeze-ai-lab/sq-vicuna-13b-w4-s0model· ♡ 2♡ 2
- 🤗squeeze-ai-lab/sq-llama-30b-w3-s0model· ♡ 4♡ 4
- 🤗squeeze-ai-lab/sq-llama-30b-w4-s0model· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
