FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

Namyoon Lee; Yongjune Kim

arXiv:2605.11478·cs.AI·May 13, 2026

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

Namyoon Lee, Yongjune Kim

PDF

TL;DR

FibQuant introduces a universal vector quantizer that enhances random-access KV-cache compression, significantly reducing memory while maintaining high fidelity in large language model inference.

Contribution

It proposes a novel fixed-rate vector quantizer with a shared radial-angular codebook, improving upon scalar quantization for KV-cache compression.

Findings

01

Achieves up to 34x compression with 0.95 cosine similarity on GPT-2 small KV caches.

02

Within 0.10 perplexity of fp16 at 4x compression on TinyLlama-1.1B.

03

Outperforms scalar TurboQuant at 8x compression with 3.6x lower perplexity.

Abstract

Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.