The case for 4-bit precision: k-bit Inference Scaling Laws

Tim Dettmers; Luke Zettlemoyer

arXiv:2212.09720·cs.LG·March 1, 2023·22 cites

The case for 4-bit precision: k-bit Inference Scaling Laws

Tim Dettmers, Luke Zettlemoyer

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper develops inference scaling laws for large language models to identify the optimal bit-precision, finding that 4-bit quantization is nearly universally optimal for balancing model size and zero-shot performance.

Contribution

The study provides extensive empirical analysis of quantization effects on LLMs, establishing that 4-bit precision offers near-optimal trade-offs across various model sizes and architectures.

Findings

01

4-bit precision is nearly optimal for zero-shot performance and model size.

02

Small block size and quantization data type influence scaling improvements.

03

Scaling laws guide optimal bit-precision choices for LLM inference.

Abstract

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qwopqwop200/GPTQ-for-LLaMa
pytorch

Models

🤗
Thireus/Vicuna13B-v1.1-8bit-128g
model· 8 dl· ♡ 16
8 dl♡ 16

Videos

The case for 4-bit precision: k-bit Inference Scaling Laws· slideslive

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning and Algorithms

MethodsMulti-Head Attention · Attention Is All You Need · BLOOM · OPT · Linear Layer · Byte Pair Encoding · Dense Connections · Attention Dropout · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia?