LocalMamba: Visual State Space Model with Windowed Selective Scan

Tao Huang; Xiaohuan Pei; Shan You; Fei Wang; Chen Qian; Chang Xu

arXiv:2403.09338·cs.CV·March 15, 2024·31 cites

LocalMamba: Visual State Space Model with Windowed Selective Scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

PDF

Open Access 2 Repos

TL;DR

LocalMamba introduces a windowed local scanning strategy with dynamic layer-wise scan pattern search to enhance vision sequence modeling, significantly improving performance over previous models like Vim-Ti on ImageNet.

Contribution

The paper proposes a novel local scanning method with dynamic scan pattern search for each layer, improving vision state space models' ability to capture local dependencies.

Findings

01

Outperforms Vim-Ti by 3.1% on ImageNet

02

Effectively captures local dependencies in images

03

Significantly improves vision sequence modeling performance

Abstract

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics