FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
Namyoon Lee, Yongjune Kim

TL;DR
FibQuant introduces a universal vector quantizer that enhances random-access KV-cache compression, significantly reducing memory while maintaining high fidelity in large language model inference.
Contribution
It proposes a novel fixed-rate vector quantizer with a shared radial-angular codebook, improving upon scalar quantization for KV-cache compression.
Findings
Achieves up to 34x compression with 0.95 cosine similarity on GPT-2 small KV caches.
Within 0.10 perplexity of fp16 at 4x compression on TinyLlama-1.1B.
Outperforms scalar TurboQuant at 8x compression with 3.6x lower perplexity.
Abstract
Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
