Training-Free Exponential Context Extension via Cascading KV Cache

Jeffrey Willette; Heejun Lee; Youngwan Lee; Myeongjae Jeon; Sung Ju; Hwang

arXiv:2406.17808·cs.CL·April 1, 2025

Training-Free Exponential Context Extension via Cascading KV Cache

Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju, Hwang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cascading KV cache mechanism that extends the context window of transformers efficiently, maintaining relevant tokens without increasing cache size, thus enabling long-sequence processing with reduced latency.

Contribution

It proposes a novel cascading sub-cache buffer system that selectively retains important tokens, outperforming linear caching methods in maintaining context and reducing prefill latency.

Findings

01

Outperforms linear caching in key benchmarks

02

Retains better retrieval accuracy at 1 million tokens

03

Reduces prefill latency by a factor of 6.8

Abstract

The transformer's context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory. However, as the context lengths increase, the computational costs grow quadratically, hindering the deployment of large language models (LLMs) in real-world, long sequence scenarios. Although some recent key-value caching (KV Cache) methods offer linear inference complexity, they naively manage the stored context, prematurely evicting tokens and losing valuable information. Moreover, they lack an optimized prefill/prompt stage strategy, resulting in higher latency than even quadratic attention for realistic context sizes. In response, we introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens, enabling the model to maintain longer context histories without increasing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeffwillette/cascading_kv_cache
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Remote Desktop Technologies · Distributed and Parallel Computing Systems

MethodsSoftmax · Attention Is All You Need