IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

TL;DR
IndexCache significantly accelerates sparse attention in large language models by reusing cross-layer indices, reducing computation by 75% with minimal quality loss, thus improving inference speed and efficiency.
Contribution
The paper introduces IndexCache, a novel method that exploits cross-layer redundancy to reuse attention indices, reducing computational complexity without retraining.
Findings
Achieves up to 1.82× prefill speedup
Removes 75% of indexer computations
Maintains negligible quality degradation
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from to . However, the indexer itself retains complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
