Sparse Attention across Multiple-context KV Cache
Ziyi Cao, Qingyi Si, Jingbin Zhang, Bingquan Liu

TL;DR
This paper introduces SamKV, a novel method for sparsifying attention in multiple-context KV caches, significantly reducing memory and computation in retrieval-augmented generation without sacrificing accuracy.
Contribution
SamKV is the first approach to apply attention sparsification to multiple-context KV caches, effectively compressing sequence length and improving efficiency in RAG scenarios.
Findings
Reduces sequence length to 15% of original
Maintains accuracy comparable to full-recompute methods
Boosts throughput in multi-context retrieval scenarios
Abstract
Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document's KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Age of Information Optimization
