eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
Pei-Chun Su

TL;DR
eOptShrinkQ is a novel two-stage compression method for transformer KV caches that combines spectral denoising and quantization, achieving near-lossless compression and improved retrieval performance.
Contribution
It introduces a spectral denoising-based compression pipeline with theoretical guarantees, outperforming existing quantization methods in transformer models.
Findings
eOptShrinkQ saves nearly one bit per entry over TurboQuant at similar quality.
It outperforms TurboQuant at 2.2 bits per entry on LongBench tasks.
Spectral denoising acts as a regularizer, enhancing retrieval tasks.
Abstract
We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matrix model. This observation leads to eOptShrinkQ, a two-stage compression pipeline: optimal singular value shrinkage (eOptShrink) automatically extracts the shared structure, and the residual -- which satisfies the \emph{thin shell property} with delocalized coordinates -- is quantized by TurboQuant~\citep{zandieh2025turboquant}, a recently proposed per-vector scalar quantizer with near-optimal distortion guarantees. By restoring the isotropy that scalar quantization assumes, spectral denoising eliminates the need for both outlier handling and dedicated inner product bias correction, freeing those bits for improved reconstruction. The theoretical grounding in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
