KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

TL;DR
KV-Fold is a training-free inference protocol enabling long-context processing in transformers by treating the KV cache as a recurrent accumulator, achieving stable, exact long-distance retrieval without retraining.
Contribution
It introduces KV-Fold, a simple, stable, training-free recurrence method for long-context inference that reuses the KV cache across chunks without model modification or retraining.
Findings
Achieves 100% exact-match retrieval on long contexts up to 128K tokens.
Maintains stability and accuracy across different chunk sizes and model families.
Operates within a 40GB GPU memory limit, enabling practical long-context inference.
Abstract
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
