A2Mamba: Attention-augmented State Space Models for Visual Recognition
Meng Lou, Yunxiang Fu, Yizhou Yu

TL;DR
A2Mamba introduces a novel hybrid architecture combining Transformers and Mamba with multi-scale attention, significantly improving performance in visual recognition tasks through enhanced spatial dependency modeling.
Contribution
The paper proposes A2Mamba, a new Transformer-Mamba hybrid with a multi-scale attention-augmented state space model that deeply integrates these components for superior visual recognition.
Findings
Achieves 86.1% top-1 accuracy on ImageNet-1K.
Outperforms previous architectures in semantic segmentation and object detection.
Uses 40% fewer parameters while maintaining higher accuracy.
Abstract
Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM's hidden states using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
