GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

TL;DR
GPTQ is a novel one-shot quantization method that significantly reduces the size and computational requirements of large GPT models, enabling efficient inference with minimal accuracy loss.
Contribution
The paper introduces GPTQ, a highly accurate and efficient one-shot quantization technique for large GPT models, achieving unprecedented compression and enabling single-GPU inference.
Findings
Quantizes 175B-parameter GPT models in ~4 GPU hours
Reduces bitwidth to 3-4 bits with negligible accuracy loss
Achieves 3.25x to 4.5x inference speedup on high-end GPUs
Abstract
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BelleGroup/BELLE-7B-gptqmodel· 7 dl· ♡ 267 dl♡ 26
- 🤗sardukar/llama13b-4bit-v2model· 3 dl· ♡ 53 dl♡ 5
- 🤗sardukar/llama7b-4bit-v2model· 6 dl· ♡ 36 dl♡ 3
- 🤗BelleGroup/BELLE-LLaMA-7B-2M-gptq-encmodel· ♡ 2♡ 2
- 🤗BelleGroup/BELLE_BLOOM_GPTQ_4BITmodel· 3 dl· ♡ 33 dl♡ 3
- 🤗Thireus/Vicuna13B-v1.1-8bit-128gmodel· 8 dl· ♡ 168 dl♡ 16
- 🤗mayank-mishra/starcoderbase-GPTQ-4bit-128gmodel· ♡ 21♡ 21
- 🤗mayank-mishra/starcoderbase-GPTQ-8bit-128gmodel· ♡ 3♡ 3
- 🤗mayank-mishra/santacoder-GPTQ-8bit-128gmodel· ♡ 1♡ 1
- 🤗mayank-mishra/santacoder-GPTQ-4bit-128gmodel· ♡ 2♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · OPT · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections
