MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Ali Hatamizadeh, Jan Kautz

TL;DR
MambaVision introduces a hybrid Mamba-Transformer backbone that combines the strengths of Mamba architecture and Vision Transformers, achieving state-of-the-art results across multiple vision tasks.
Contribution
The paper presents a novel hybrid Mamba-Transformer architecture, integrating self-attention into Mamba to enhance long-range dependency modeling in vision applications.
Findings
Achieves SOTA Top-1 accuracy on ImageNet-1K
Outperforms comparable backbones in object detection and segmentation
Demonstrates efficient modeling of visual features with hybrid architecture
Abstract
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/MambaVision-T-1Kmodel· 4.6k dl· ♡ 414.6k dl♡ 41
- 🤗nvidia/MambaVision-T2-1Kmodel· 85 dl· ♡ 585 dl♡ 5
- 🤗nvidia/MambaVision-S-1Kmodel· 911 dl· ♡ 9911 dl♡ 9
- 🤗nvidia/MambaVision-L-1Kmodel· 87 dl· ♡ 587 dl♡ 5
- 🤗nvidia/MambaVision-L2-1Kmodel· 21 dl· ♡ 1321 dl♡ 13
- 🤗nvidia/MambaVision-B-21Kmodel· 187 dl· ♡ 6187 dl♡ 6
- 🤗nvidia/MambaVision-L-21Kmodel· 91 dl· ♡ 491 dl♡ 4
- 🤗nvidia/MambaVision-L2-512-21Kmodel· 21 dl· ♡ 321 dl♡ 3
- 🤗nvidia/MambaVision-L3-512-21Kmodel· 99 dl· ♡ 5499 dl♡ 54
- 🤗nvidia/MambaVision-L3-256-21Kmodel· 49 dl· ♡ 749 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
