LocalMamba: Visual State Space Model with Windowed Selective Scan
Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

TL;DR
LocalMamba introduces a windowed local scanning strategy with dynamic layer-wise scan pattern search to enhance vision sequence modeling, significantly improving performance over previous models like Vim-Ti on ImageNet.
Contribution
The paper proposes a novel local scanning method with dynamic scan pattern search for each layer, improving vision state space models' ability to capture local dependencies.
Findings
Outperforms Vim-Ti by 3.1% on ImageNet
Effectively captures local dependencies in images
Significantly improves vision sequence modeling performance
Abstract
Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
