LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers; Mike Lewis; Younes Belkada; Luke Zettlemoyer

arXiv:2208.07339·cs.LG·November 11, 2022·112 cites

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

PDF

Open Access 4 Repos 8 Models

TL;DR

This paper introduces LLM.int8(), a novel 8-bit matrix multiplication method that significantly reduces memory usage in large language model inference without sacrificing performance, enabling more accessible deployment.

Contribution

We develop a two-part quantization procedure for transformer models that maintains full performance at 8-bit precision, including handling emergent outliers with mixed-precision decomposition.

Findings

01

Enables inference of 175B parameter models in 8-bit without performance loss.

02

Reduces memory requirements by half for large language models.

03

Open-sources the software for broader adoption.

Abstract

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Topic Modeling