Pushing the Limits of Large Language Model Quantization via the   Linearity Theorem

Vladimir Malinovskii; Andrei Panferov; Ivan Ilin; Han Guo; Peter; Richt\'arik; Dan Alistarh

arXiv:2411.17525·cs.LG·November 27, 2024

Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter, Richt\'arik, Dan Alistarh

PDF

Open Access 2 Repos

TL;DR

This paper introduces a linearity theorem linking layer-wise quantization error to language model perplexity, enabling new data-free and non-uniform quantization methods that improve efficiency and accuracy in large language models.

Contribution

The paper presents a theoretical linearity theorem for LLM quantization, and develops novel data-free and non-uniform quantization techniques based on this insight.

Findings

01

HIGGS outperforms prior data-free quantization methods.

02

The optimal non-uniform quantization solution matches compression constraints effectively.

03

Enhanced accuracy-compression trade-offs demonstrated on Llama-3.1, 3.2, and Qwen models.

Abstract

Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise $ℓ_{2}$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus