Sparse Attention across Multiple-context KV Cache

Ziyi Cao; Qingyi Si; Jingbin Zhang; Bingquan Liu

arXiv:2508.11661·cs.LG·August 19, 2025

Sparse Attention across Multiple-context KV Cache

Ziyi Cao, Qingyi Si, Jingbin Zhang, Bingquan Liu

PDF

Open Access

TL;DR

This paper introduces SamKV, a novel method for sparsifying attention in multiple-context KV caches, significantly reducing memory and computation in retrieval-augmented generation without sacrificing accuracy.

Contribution

SamKV is the first approach to apply attention sparsification to multiple-context KV caches, effectively compressing sequence length and improving efficiency in RAG scenarios.

Findings

01

Reduces sequence length to 15% of original

02

Maintains accuracy comparable to full-recompute methods

03

Boosts throughput in multi-context retrieval scenarios

Abstract

Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document's KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Age of Information Optimization