QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos; Amirkeivan Mohtashami; Maximilian L. Croci; Bo Li,; Pashmina Cameron; Martin Jaggi; Dan Alistarh; Torsten Hoefler; James Hensman

arXiv:2404.00456·cs.LG·October 30, 2024·3 cites

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li,, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

PDF

Open Access 3 Repos 1 Video

TL;DR

QuaRot introduces a rotation-based quantization method enabling end-to-end 4-bit inference in large language models, effectively removing outliers and maintaining high performance without calibration data.

Contribution

The paper presents QuaRot, a novel rotation-based quantization scheme that achieves outlier-free 4-bit LLM inference, including weights, activations, and KV cache, with minimal performance loss.

Findings

01

4-bit LLaMa2-70B retains 99% zero-shot performance.

02

Losses are at most 0.47 perplexity on WikiText-2.

03

Lossless 6 and 8-bit models without calibration data.

Abstract

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism, and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4 bits, without any channels identified for retention in higher precision. Our 4-bit quantized LLaMa2-70B model has losses of at most 0.47 WikiText-2 perplexity and retains 99% of the zero-shot performance. We also show that QuaRot can provide lossless 6 and 8 bit LLaMa2 models without any calibration data using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs· slideslive

Taxonomy

TopicsOptical Network Technologies · Advanced Wireless Communication Techniques · Error Correcting Code Techniques