SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim; Coleman Hooper; Amir Gholami; Zhen Dong; Xiuyu Li; Sheng; Shen; Michael W. Mahoney; Kurt Keutzer

arXiv:2306.07629·cs.CL·June 6, 2024·23 cites

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng, Shen, Michael W. Mahoney, Kurt Keutzer

PDF

Open Access 3 Repos 10 Models

TL;DR

SqueezeLLM introduces a novel post-training quantization framework that enables ultra-low precision (up to 3-bit) compression of large language models, significantly reducing memory bandwidth bottlenecks and accelerating inference without performance loss.

Contribution

The paper presents a new quantization method combining sensitivity-based non-uniform quantization and Dense-and-Sparse decomposition, achieving lossless compression and improved inference speed for LLMs.

Findings

01

3-bit quantization reduces perplexity gap by up to 2.1x

02

Models achieve up to 2.3x speedup on GPU

03

Outperforms state-of-the-art quantization methods

Abstract

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis