MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Ruijie Zhou; Fanxu Meng; Yufei Xu; Tongxuan Liu; Guangming Lu; Muhan Zhang; Wenjie Pei

arXiv:2605.07363·cs.LG·May 11, 2026

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei

PDF

TL;DR

MISA introduces a mixture-of-experts approach to sparse attention, significantly reducing computational costs while maintaining high accuracy in long-context language model inference.

Contribution

MISA replaces the existing indexer with a mixture-of-experts design, enabling efficient token selection without retraining, and achieves comparable or better performance.

Findings

01

MISA matches DSA performance on LongBench with fewer heads.

02

MISA reduces indexer head count by 4-8 times while maintaining accuracy.

03

TileLang kernel speeds up indexer computation by 3.82x.

Abstract

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.