QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher, De Sa

TL;DR
QuIP# is a novel post-training quantization method for large language models that employs Hadamard transforms, lattice codebooks, and fine-tuning to achieve state-of-the-art compression at extremely low bit-widths, enabling efficient and accurate inference.
Contribution
Introduces QuIP#, a PTQ technique combining Hadamard incoherence, E8 lattice codebooks, and fine-tuning for superior LLM weight compression.
Findings
Outperforms existing PTQ methods in extreme compression regimes.
Enables fast inference with minimal accuracy loss.
Supports new behaviors in PTQ scaling.
Abstract
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ( 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security
MethodsSparse Evolutionary Training
