LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

TL;DR
This paper introduces LLM.int8(), a novel 8-bit matrix multiplication method that significantly reduces memory usage in large language model inference without sacrificing performance, enabling more accessible deployment.
Contribution
We develop a two-part quantization procedure for transformer models that maintains full performance at 8-bit precision, including handling emergent outliers with mixed-precision decomposition.
Findings
Enables inference of 175B parameter models in 8-bit without performance loss.
Reduces memory requirements by half for large language models.
Open-sources the software for broader adoption.
Abstract
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ybelkada/bloom-560m-8bitmodel· 8 dl8 dl
- 🤗ybelkada/papersmodel
- 🤗ybelkada/bloom-1b7-8bitmodel· 461 dl· ♡ 6461 dl♡ 6
- 🤗SwastikM/Llama-2-7B-Chat-text2codemodel· 11 dl· ♡ 411 dl♡ 4
- 🤗akameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4model· 9 dl9 dl
- 🤗YuvrajSingh9886/facebook-opt-350m-8bit-bnbmodel
- 🤗Lyon28/caca-1M-untrainedmodel· 16 dl16 dl
- 🤗dbhavery/fineforge-qlora-pipelinemodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Topic Modeling
