QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa

TL;DR
This paper introduces QuIP, a novel 2-bit quantization method for large language models that guarantees performance improvements through incoherence processing, supported by theoretical analysis and empirical results.
Contribution
We propose QuIP, a new 2-bit quantization technique with incoherence processing, and provide the first theoretical analysis for LLM-scale quantization algorithms.
Findings
Incoherence preprocessing improves existing quantization methods.
QuIP achieves viable 2-bit quantization results for large language models.
Theoretical analysis applies to QuIP and existing methods like OPTQ.
Abstract
This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
