Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion
Chaodong Xiao, Minghan Li, Zhengqiang Zhang, Deyu Meng, Lei Zhang

TL;DR
Spatial-Mamba introduces a structure-aware state fusion mechanism in visual state space models, effectively capturing complex spatial dependencies in images with improved efficiency, leading to state-of-the-art results in vision tasks.
Contribution
It proposes a novel structure-aware state fusion approach that directly models spatial dependencies in visual state space models, unifying existing methods and enhancing performance.
Findings
Achieves state-of-the-art results in image classification, detection, and segmentation.
Unifies Mamba and linear attention under a matrix multiplication framework.
Operates effectively with a single scan, reducing computational costs.
Abstract
Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Video Surveillance and Tracking Methods
MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
