Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Luca Benfenati; Matteo Risso; Andrea Vannozzi; Ahmet Caner Y\"uz\"ug\"uler; Lukas Cavigelli; Enrico Macii; Daniele Jahier Pagliari; Alessio Burrello

arXiv:2601.21686·cs.LG·January 30, 2026

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

PDF

Open Access

TL;DR

StiefAttention introduces a novel method for KV-cache compression that learns orthonormal projections by directly minimizing decoder-layer output errors, improving performance over existing approaches.

Contribution

It proposes a new post-training compression technique that optimizes orthonormal projection bases based on end-to-end decoder output reconstruction error.

Findings

01

Outperforms EigenAttention by 11.9 points on C4 perplexity

02

Achieves 5.4% higher 0-shot MMLU accuracy at the same compression level

03

Provides flexible layer-wise rank allocation under an error budget

Abstract

Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Parallel Computing and Optimization Techniques