IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai; Qian Dong; Ting Jiang; Xin Lv; Zhengxiao Du; Aohan Zeng; Jie Tang; Juanzi Li

arXiv:2603.12201·cs.CL·March 13, 2026

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

PDF

Open Access

TL;DR

IndexCache significantly accelerates sparse attention in large language models by reusing cross-layer indices, reducing computation by 75% with minimal quality loss, thus improving inference speed and efficiency.

Contribution

The paper introduces IndexCache, a novel method that exploits cross-layer redundancy to reuse attention indices, reducing computational complexity without retraining.

Findings

01

Achieves up to 1.82× prefill speedup

02

Removes 75% of indexer computations

03

Maintains negligible quality degradation

Abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O (L^{2})$ to $O (L k)$ . However, the indexer itself retains $O (L^{2})$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy