TL;DR
OCTOPUS introduces a novel joint quantization method for key-value caches in transformers, using octahedral parametrization and optimal squared error quantization, achieving superior compression with minimal latency.
Contribution
It develops a new structured quantization scheme that improves KV cache compression in transformers through octahedral parametrization and optimal error minimization.
Findings
Matches or exceeds prior rotation codecs at all bit widths and metrics.
Provides a data-oblivious, online, and deterministic codec.
Reduces decode-time bandwidth and latency by reconstructing keys on the fly.
Abstract
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
