RefreshKV: Updating Small KV Cache During Long-form Generation

Fangyuan Xu; Tanya Goyal; Eunsol Choi

arXiv:2411.05787·cs.CL·March 4, 2025

RefreshKV: Updating Small KV Cache During Long-form Generation

Fangyuan Xu, Tanya Goyal, Eunsol Choi

PDF

Open Access 1 Repo 1 Video

TL;DR

RefreshKV is a novel inference method for long-form generation with LLMs that dynamically updates a small KV cache, balancing speed and performance by alternating between full and partial attention.

Contribution

It introduces a flexible approach to update the KV cache during generation, improving long-form output quality while maintaining speed.

Findings

01

Achieves comparable speedup to eviction-based methods.

02

Improves long-form generation performance.

03

Continued pretraining with RefreshKV enhances results.

Abstract

Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

carriex/recycled-attention
noneOfficial

Videos

RefreshKV: Updating Small KV Cache During Long-form Generation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need