Mixture of Sparse Attention: Content-Based Learnable Sparse Attention   via Expert-Choice Routing

Piotr Pi\k{e}kos; R\'obert Csord\'as; J\"urgen Schmidhuber

arXiv:2505.00315·cs.LG·May 2, 2025

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Piotr Pi\k{e}kos, R\'obert Csord\'as, J\"urgen Schmidhuber

PDF

Open Access 1 Repo

TL;DR

Mixture of Sparse Attention (MoSA) introduces a dynamic, content-based sparse attention mechanism inspired by MoE, significantly reducing computational costs and outperforming dense attention in language modeling tasks.

Contribution

MoSA is the first sparse attention method that dynamically selects tokens for attention, enabling higher efficiency and better performance than dense models.

Findings

01

MoSA reduces attention complexity from O(T^2) to O(k^2 + T).

02

MoSA outperforms dense baselines with up to 27% better perplexity.

03

MoSA models are faster, use less memory, and have smaller KV-cache sizes.

Abstract

Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting $k$ tokens from a sequence of length $T$ , MoSA reduces the computational complexity of each attention head from $O (T^{2})$ to $O (k^{2} + T)$ . This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

piotrpiekos/MoSA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need