Mamba-R: Vision Mamba ALSO Needs Registers
Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

TL;DR
Mamba-R introduces register tokens into Vision Mamba to reduce artifacts, improve focus on meaningful regions, and enhance performance, especially at larger scales, demonstrated on ImageNet and segmentation tasks.
Contribution
This paper proposes Mamba-R, a novel architecture that incorporates register tokens into Vision Mamba to mitigate artifacts and improve scalability and accuracy.
Findings
Mamba-R achieves 83.0% accuracy on ImageNet with a base model.
Scaling Mamba-R to 341M parameters yields 83.6% accuracy.
Qualitative results show cleaner, more focused feature maps.
Abstract
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba-R. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps appear cleaner and more focused on semantically meaningful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
