Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Pi\k{e}kos, R\'obert Csord\'as, J\"urgen Schmidhuber

TL;DR
Mixture of Sparse Attention (MoSA) introduces a dynamic, content-based sparse attention mechanism inspired by MoE, significantly reducing computational costs and outperforming dense attention in language modeling tasks.
Contribution
MoSA is the first sparse attention method that dynamically selects tokens for attention, enabling higher efficiency and better performance than dense models.
Findings
MoSA reduces attention complexity from O(T^2) to O(k^2 + T).
MoSA outperforms dense baselines with up to 27% better perplexity.
MoSA models are faster, use less memory, and have smaller KV-cache sizes.
Abstract
Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting tokens from a sequence of length , MoSA reduces the computational complexity of each attention head from to . This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need
