SmoothQuant: Accurate and Efficient Post-Training Quantization for Large   Language Models

Guangxuan Xiao; Ji Lin; Mickael Seznec; Hao Wu; Julien Demouth; Song; Han

arXiv:2211.10438·cs.CL·April 3, 2024·97 cites

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song, Han

PDF

Open Access 5 Repos 10 Models

TL;DR

SmoothQuant is a post-training quantization method that enables efficient 8-bit quantization of large language models without significant accuracy loss, reducing hardware costs and enabling larger models to run on limited hardware.

Contribution

It introduces a novel transformation that shifts quantization difficulty from activations to weights, allowing accurate 8-bit quantization of LLMs without retraining.

Findings

01

Up to 1.56x speedup in inference

02

2x reduction in memory usage

03

Effective on multiple large language models

Abstract

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsOPT · BLOOM · GLM · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings