Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

TL;DR
StiefAttention introduces a novel method for KV-cache compression that learns orthonormal projections by directly minimizing decoder-layer output errors, improving performance over existing approaches.
Contribution
It proposes a new post-training compression technique that optimizes orthonormal projection bases based on end-to-end decoder output reconstruction error.
Findings
Outperforms EigenAttention by 11.9 points on C4 perplexity
Achieves 5.4% higher 0-shot MMLU accuracy at the same compression level
Provides flexible layer-wise rank allocation under an error budget
Abstract
Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Parallel Computing and Optimization Techniques
