KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang; Bairu Hou; Wei Wei; Yujia Bao; Shiyu Chang

arXiv:2502.16002·cs.CL·November 11, 2025

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

PDF

Open Access 1 Repo 1 Video

TL;DR

KVLink is a method that precomputes and reuses key-value caches in large language models to significantly reduce inference time and improve accuracy when handling overlapping contexts, enabling scalable and efficient LLM deployment.

Contribution

This paper introduces KVLink, a novel approach for precomputing and concatenating KV caches to eliminate redundant computation in LLMs, with techniques to maintain performance and attention integrity.

Findings

01

Improves question answering accuracy by 4% on average.

02

Reduces time-to-first-token by up to 96%.

03

Outperforms state-of-the-art methods with cache reuse.

Abstract

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UCSB-NLP-Chang/KVLink
pytorchOfficial

Videos

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse· slideslive

Taxonomy

TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques