SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging
Nhat Thanh Tran, Fanghui Xue, Shuai Zhang, Jiancheng Lyu, Yunling Zheng, Yingyong Qi, Jack Xin

TL;DR
This paper introduces SEMA, a novel attention mechanism for vision transformers that improves scalability and focus by combining token localization with averaging, outperforming existing linear and Mamba-like attention models on ImageNet-1k.
Contribution
The paper proposes SEMA, a new attention method that avoids dispersion and enhances focus, providing a scalable and effective alternative to linear attention in vision tasks.
Findings
SEMA outperforms recent vision Mamba models on ImageNet-1k.
SEMA maintains focus while avoiding dispersion in attention.
SEMA scales effectively to larger image sizes.
Abstract
Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
MethodsAttention Is All You Need · Focus · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
