GPTQ: Accurate Post-Training Quantization for Generative Pre-trained   Transformers

Elias Frantar; Saleh Ashkboos; Torsten Hoefler; Dan Alistarh

arXiv:2210.17323·cs.LG·March 23, 2023·130 cites

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

PDF

Open Access 5 Repos 10 Models

TL;DR

GPTQ is a novel one-shot quantization method that significantly reduces the size and computational requirements of large GPT models, enabling efficient inference with minimal accuracy loss.

Contribution

The paper introduces GPTQ, a highly accurate and efficient one-shot quantization technique for large GPT models, achieving unprecedented compression and enabling single-GPU inference.

Findings

01

Quantizes 175B-parameter GPT models in ~4 GPU hours

02

Reduces bitwidth to 3-4 bits with negligible accuracy loss

03

Achieves 3.25x to 4.5x inference speedup on high-end GPUs

Abstract

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · OPT · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections