VMamba: Visual State Space Model

Yue Liu; Yunjie Tian; Yuzhong Zhao; Hongtian Yu; Lingxi; Xie; Yaowei Wang; Qixiang Ye; Jianbin Jiao; Yunfan Liu

arXiv:2401.10166·cs.CV·December 31, 2024·360 cites

VMamba: Visual State Space Model

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi, Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu

PDF

Open Access 5 Repos 8 Models 1 Video

TL;DR

VMamba introduces a novel vision backbone using Visual State-Space blocks with linear time complexity, effectively capturing contextual information in 2D vision data, and demonstrates superior efficiency and performance across various visual tasks.

Contribution

It adapts the Mamba state-space model into a vision-specific architecture with a new SS2D module, achieving efficient processing of 2D visual data.

Findings

01

Superior input scaling efficiency compared to benchmarks

02

Effective collection of contextual information from multiple perspectives

03

Promising performance across diverse visual perception tasks

Abstract

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

VMamba: Visual State Space Model· slideslive

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Vision and Imaging