KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

TL;DR
KVLink is a method that precomputes and reuses key-value caches in large language models to significantly reduce inference time and improve accuracy when handling overlapping contexts, enabling scalable and efficient LLM deployment.
Contribution
This paper introduces KVLink, a novel approach for precomputing and concatenating KV caches to eliminate redundant computation in LLMs, with techniques to maintain performance and attention integrity.
Findings
Improves question answering accuracy by 4% on average.
Reduces time-to-first-token by up to 96%.
Outperforms state-of-the-art methods with cache reuse.
Abstract
We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques
