VMamba: Visual State Space Model
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi, Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu

TL;DR
VMamba introduces a novel vision backbone using Visual State-Space blocks with linear time complexity, effectively capturing contextual information in 2D vision data, and demonstrates superior efficiency and performance across various visual tasks.
Contribution
It adapts the Mamba state-space model into a vision-specific architecture with a new SS2D module, achieving efficient processing of 2D visual data.
Findings
Superior input scaling efficiency compared to benchmarks
Effective collection of contextual information from multiple perspectives
Promising performance across diverse visual perception tasks
Abstract
Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗saurabhati/DASS_small_AudioSet_47.2model· 2 dl· ♡ 12 dl♡ 1
- 🤗saurabhati/DASS_medium_AudioSet_47.6model· 2 dl2 dl
- 🤗saurabhati/VMamba_ImageNet_82.6model· 152 dl· ♡ 3152 dl♡ 3
- 🤗saurabhati/VMamba_ImageNet_83.6model· 83 dl83 dl
- 🤗saurabhati/DASS_small_AudioSet_48.6model· 10 dl10 dl
- 🤗saurabhati/DASS_medium_AudioSet_48.9model
- 🤗saurabhati/DASS_small_AudioSet_50.1model· 45 dl45 dl
- 🤗saurabhati/DASS_medium_AudioSet_50.2model· 53 dl· ♡ 253 dl♡ 2
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Vision and Imaging
