TL;DR
This paper introduces Mixture-of-Top-k Attention (MiTA), a scalable and flexible attention mechanism that improves efficiency and effectiveness in vision tasks by using deformable fast-weight experts and reusable top-k sets.
Contribution
It proposes MiTA, a novel attention method combining deformable fast-weight experts with reusable top-k sets, enhancing scalability and flexibility over prior methods.
Findings
MiTA outperforms existing methods in vision tasks.
MiTA exhibits an emergent token-pruning effect.
MiTA generalizes well from standard attention mechanisms.
Abstract
The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length . As the context extends, the expressive capacity of such an -width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top- Attention (MiTA), which employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
