Training-Free Exponential Context Extension via Cascading KV Cache
Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju, Hwang

TL;DR
This paper introduces a cascading KV cache mechanism that extends the context window of transformers efficiently, maintaining relevant tokens without increasing cache size, thus enabling long-sequence processing with reduced latency.
Contribution
It proposes a novel cascading sub-cache buffer system that selectively retains important tokens, outperforming linear caching methods in maintaining context and reducing prefill latency.
Findings
Outperforms linear caching in key benchmarks
Retains better retrieval accuracy at 1 million tokens
Reduces prefill latency by a factor of 6.8
Abstract
The transformer's context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory. However, as the context lengths increase, the computational costs grow quadratically, hindering the deployment of large language models (LLMs) in real-world, long sequence scenarios. Although some recent key-value caching (KV Cache) methods offer linear inference complexity, they naively manage the stored context, prematurely evicting tokens and losing valuable information. Moreover, they lack an optimized prefill/prompt stage strategy, resulting in higher latency than even quadratic attention for realistic context sizes. In response, we introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens, enabling the model to maintain longer context histories without increasing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Remote Desktop Technologies · Distributed and Parallel Computing Systems
MethodsSoftmax · Attention Is All You Need
