SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song, Han

TL;DR
SmoothQuant is a post-training quantization method that enables efficient 8-bit quantization of large language models without significant accuracy loss, reducing hardware costs and enabling larger models to run on limited hardware.
Contribution
It introduces a novel transformation that shifts quantization difficulty from activations to weights, allowing accurate 8-bit quantization of LLMs without retraining.
Findings
Up to 1.56x speedup in inference
2x reduction in memory usage
Effective on multiple large language models
Abstract
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RedHatAI/Phi-3-medium-128k-instruct-quantized.w8a8model· 23 dl· ♡ 223 dl♡ 2
- 🤗RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8model· 13k dl· ♡ 713k dl♡ 7
- 🤗RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8model· 311 dl· ♡ 1311 dl♡ 1
- 🤗RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8model· 929 dl· ♡ 2929 dl♡ 2
- 🤗RedHatAI/phi-4-quantized.w8a8model· 456 dl· ♡ 3456 dl♡ 3
- 🤗RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w8a8model· 20k dl· ♡ 120k dl♡ 1
- 🤗RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8model· 215 dl· ♡ 5215 dl♡ 5
- 🤗neuralmagic/Llama-3.2-3B-Instruct-quantized.w8a8model· 316 dl316 dl
- 🤗ArslanRobo/llama-3.1-8b-instruct-smoothquant-Pruned30Tailor-fp16model· 1 dl1 dl
- 🤗RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8model· 108 dl108 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsOPT · BLOOM · GLM · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
