Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

Adithya Bhaskar; Alexander Wettig; Tianyu Gao; Yihe Dong; Danqi Chen

arXiv:2506.17121·cs.CL·June 23, 2025

Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces the KV footprint metric to evaluate and optimize key-value cache memory in long-context language models, proposing new eviction strategies and a learned method to reduce memory use without sacrificing performance.

Contribution

It proposes the KV footprint metric, adapts eviction methods for pre-filling, and introduces PruLong, a learned approach to minimize memory while maintaining long-context understanding.

Findings

01

PruLong reduces KV footprint by 12% compared to prior methods.

02

Adapting eviction methods for pre-filling lowers peak memory usage.

03

KV footprint metric effectively captures memory efficiency in long-context models.

Abstract

Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the *KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods -- *post-fill eviction* -- has a high footprint due to being…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-pli/prulong
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersonal Information Management and User Behavior · Topic Modeling · Information Retrieval and Search Behavior