GPTVQ: The Blessing of Dimensionality for LLM Quantization

Mart van Baalen; Andrey Kuzmin; Ivan Koryakovskiy; Markus Nagel; Peter Couperus; Cedric Bastoul; Eric Mahurin; Tijmen Blankevoort; Paul Whatmough

arXiv:2402.15319·cs.LG·June 4, 2025·2 cites

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Mart van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

PDF

Open Access

TL;DR

GPTVQ introduces a novel high-dimensional vector quantization method that significantly improves the size-accuracy trade-off for large language models, enabling efficient post-training quantization with state-of-the-art results.

Contribution

The paper presents GPTVQ, a fast, scalable post-training quantization method leveraging high-dimensional vector quantization and Hessian-based updates for large language models.

Findings

01

Achieves new state-of-the-art size-accuracy trade-offs on Llama-v2 and Mistral models.

02

Efficiently processes large models within 3-11 hours on H100 hardware.

03

VQ decoding on mobile CPUs offers latency improvements over 4-bit integer formats.

Abstract

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Atomic and Subatomic Physics Research