OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Mark Boss; Vikram Voleti; Simon Donn\'e; Shimon Vainer

arXiv:2605.21226·cs.LG·May 21, 2026

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer

PDF

1 Repo

TL;DR

OCTOPUS introduces a novel joint quantization method for key-value caches in transformers, using octahedral parametrization and optimal squared error quantization, achieving superior compression with minimal latency.

Contribution

It develops a new structured quantization scheme that improves KV cache compression in transformers through octahedral parametrization and optimal error minimization.

Findings

01

Matches or exceeds prior rotation codecs at all bit widths and metrics.

02

Provides a data-oblivious, online, and deterministic codec.

03

Reduces decode-time bandwidth and latency by reconstructing keys on the fly.

Abstract

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://octopus-quant.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.