KV Cache Offloading for Context-Intensive Tasks
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

TL;DR
This paper investigates KV-cache offloading for long-context tasks in large language models, revealing performance issues and proposing a simple strategy to improve accuracy in context-intensive scenarios.
Contribution
It introduces the Text2JSON benchmark for evaluating KV offloading on context-heavy tasks and proposes a straightforward method that enhances accuracy across multiple models.
Findings
Significant performance degradation observed with KV offloading on Text2JSON and other tasks.
Identified low-rank projection and unreliable landmarks as causes of poor accuracy.
Proposed a simple alternative strategy that improves accuracy across models.
Abstract
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
