RefreshKV: Updating Small KV Cache During Long-form Generation
Fangyuan Xu, Tanya Goyal, Eunsol Choi

TL;DR
RefreshKV is a novel inference method for long-form generation with LLMs that dynamically updates a small KV cache, balancing speed and performance by alternating between full and partial attention.
Contribution
It introduces a flexible approach to update the KV cache during generation, improving long-form output quality while maintaining speed.
Findings
Achieves comparable speedup to eviction-based methods.
Improves long-form generation performance.
Continued pretraining with RefreshKV enhances results.
Abstract
Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
