Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Mohsen Dehghankar, Abolfazl Asudeh

TL;DR
This paper introduces Louver, a novel index structure for sparse attention in large language models, ensuring zero false negatives and improving inference efficiency by reformulating attention as a range searching problem.
Contribution
Louver is a lightweight, hardware-aware index that guarantees full recall of relevant keys, outperforming prior methods in accuracy and runtime for KV cache retrieval.
Findings
Louver guarantees zero false negatives in key retrieval.
Louver outperforms prior sparse attention methods in accuracy and speed.
Louver is faster than optimized dense attention methods like FlashAttention.
Abstract
Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
