Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

Qishuai Wen; Zhiyuan Huang; Xianghan Meng; Wei He; Chun-Guang Li

arXiv:2602.01219·cs.LG·May 12, 2026

Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

PDF

1 Repo

TL;DR

This paper introduces Mixture-of-Top-k Attention (MiTA), a scalable and flexible attention mechanism that improves efficiency and effectiveness in vision tasks by using deformable fast-weight experts and reusable top-k sets.

Contribution

It proposes MiTA, a novel attention method combining deformable fast-weight experts with reusable top-k sets, enhancing scalability and flexibility over prior methods.

Findings

01

MiTA outperforms existing methods in vision tasks.

02

MiTA exhibits an emergent token-pruning effect.

03

MiTA generalizes well from standard attention mechanisms.

Abstract

The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length $N$ . As the context extends, the expressive capacity of such an $N$ -width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top- $k$ Attention (MiTA), which employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

QishuaiWen/MiTA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.