The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang

TL;DR
This paper demonstrates that the key-value cache in transformer inference is redundant, as the residual stream alone contains all necessary information, enabling a new memory-efficient inference method called KV-Direct.
Contribution
The authors prove the redundancy of the KV cache, show residuals can fully replace it, and introduce KV-Direct, a memory-efficient inference scheme that recomputes keys and values from residuals.
Findings
Recomputing keys and values from residuals yields bit-identical outputs.
KV-Direct reduces peak memory usage significantly compared to standard caching.
KV-Direct maintains perfect token accuracy across various cache budgets.
Abstract
The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management
