MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei

TL;DR
MISA introduces a mixture-of-experts approach to sparse attention, significantly reducing computational costs while maintaining high accuracy in long-context language model inference.
Contribution
MISA replaces the existing indexer with a mixture-of-experts design, enabling efficient token selection without retraining, and achieves comparable or better performance.
Findings
MISA matches DSA performance on LongBench with fewer heads.
MISA reduces indexer head count by 4-8 times while maintaining accuracy.
TileLang kernel speeds up indexer computation by 3.82x.
Abstract
DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
