Pushing the Limits of Large Language Model Quantization via the Linearity Theorem
Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter, Richt\'arik, Dan Alistarh

TL;DR
This paper introduces a linearity theorem linking layer-wise quantization error to language model perplexity, enabling new data-free and non-uniform quantization methods that improve efficiency and accuracy in large language models.
Contribution
The paper presents a theoretical linearity theorem for LLM quantization, and develops novel data-free and non-uniform quantization techniques based on this insight.
Findings
HIGGS outperforms prior data-free quantization methods.
The optimal non-uniform quantization solution matches compression constraints effectively.
Enhanced accuracy-compression trade-offs demonstrated on Llama-3.1, 3.2, and Qwen models.
Abstract
Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
