MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh; Jan Kautz

arXiv:2407.08083·cs.CV·March 26, 2025·27 cites

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz

PDF

Open Access 3 Repos 10 Models

TL;DR

MambaVision introduces a hybrid Mamba-Transformer backbone that combines the strengths of Mamba architecture and Vision Transformers, achieving state-of-the-art results across multiple vision tasks.

Contribution

The paper presents a novel hybrid Mamba-Transformer architecture, integrating self-attention into Mamba to enhance long-range dependency modeling in vision applications.

Findings

01

Achieves SOTA Top-1 accuracy on ImageNet-1K

02

Outperforms comparable backbones in object detection and segmentation

03

Demonstrates efficient modeling of visual features with hybrid architecture

Abstract

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging