KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov; Ivan Ermakov; Denis Kuznedelev; Vyacheslav Zhdanovskiy; Yegor Yershov

arXiv:2604.08426·cs.LG·May 18, 2026

KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

PDF

TL;DR

This paper investigates KV-cache offloading for long-context tasks in large language models, revealing performance issues and proposing a simple strategy to improve accuracy in context-intensive scenarios.

Contribution

It introduces the Text2JSON benchmark for evaluating KV offloading on context-heavy tasks and proposes a straightforward method that enhances accuracy across multiple models.

Findings

01

Significant performance degradation observed with KV offloading on Text2JSON and other tasks.

02

Identified low-rank projection and unreliable landmarks as causes of poor accuracy.

03

Proposed a simple alternative strategy that improves accuracy across models.

Abstract

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.