SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran; Fanghui Xue; Shuai Zhang; Jiancheng Lyu; Yunling Zheng; Yingyong Qi; Jack Xin

arXiv:2506.08297·cs.CV·June 11, 2025

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran, Fanghui Xue, Shuai Zhang, Jiancheng Lyu, Yunling Zheng, Yingyong Qi, Jack Xin

PDF

Open Access

TL;DR

This paper introduces SEMA, a novel attention mechanism for vision transformers that improves scalability and focus by combining token localization with averaging, outperforming existing linear and Mamba-like attention models on ImageNet-1k.

Contribution

The paper proposes SEMA, a new attention method that avoids dispersion and enhances focus, providing a scalable and effective alternative to linear attention in vision tasks.

Findings

01

SEMA outperforms recent vision Mamba models on ImageNet-1k.

02

SEMA maintains focus while avoiding dispersion in attention.

03

SEMA scales effectively to larger image sizes.

Abstract

Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · Focus · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces