Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Anastasiia Filippova, David Grangier, Marco Cuturi, Jo\~ao Monteiro

TL;DR
This paper introduces a stochastic training method that enables flexible depth-wise cache sharing in transformer models, reducing memory usage without sacrificing performance.
Contribution
It proposes a simple training approach, random cross-layer attention, to make models robust to depth-wise cache sharing, facilitating efficient KV cache reduction.
Findings
Depth-wise cache sharing can significantly reduce memory footprint.
Random cross-layer attention maintains or improves model performance.
The method is effective during pre-training and fine-tuning across various models.
Abstract
Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
