SF-Mamba: Rethinking State Space Model for Vision
Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi

TL;DR
SF-Mamba introduces a new vision model that enhances the efficiency and effectiveness of the Mamba architecture by enabling bidirectional information flow and better GPU parallelism, outperforming existing methods across multiple vision tasks.
Contribution
The paper proposes SF-Mamba, a novel vision model that improves upon Mamba by incorporating auxiliary patch swapping and batch folding, addressing limitations in non-causal interactions and computational speed.
Findings
Outperforms state-of-the-art baselines in image classification, detection, and segmentation.
Achieves higher throughput across various model sizes.
Demonstrates significant efficiency and accuracy improvements.
Abstract
The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding…
Peer Reviews
Decision·Submitted to ICLR 2026
+ The manuscript is well-presented and easy to follow. + It is good to see analysis on the inference speed of the mamba based model. + The reset trick is interesting.
+ The proposed method is only testifed on MambaVision. It should be applied to more mamba-based models to support the claim of "Rethinking State Space Model for Vision". + Swapping last token is not equivant to bi-directional scan, and the author failed to prove the superiority of swapping last token. As in table 3, if the attention is removed, swapping last token worse than parallel bi-scan and even series bi-scan. Since attention itself is a undirectional operation, this seems that switching t
1. This paper targets critical pain points of visual Mamba (causality constraint, short-sequence inefficiency) with lightweight, non-intrusive solutions. 2. Comprehensive validation across three fundamental vision tasks (classification, detection, segmentation) with consistent performance gains. 3. Practical optimizations (e.g., adaptive $\(B_1\)$, Triton kernel for swapping) enhance real-world applicability, with code release planned.
1. The macro-architecture heavily relies on MambaVision’s hybrid (Mamba+Attention) design, lacking significant innovations in overall network structure. 2. Ablation studies on auxiliary token initialization and discard timing are limited; deeper analysis of their impact on different tasks is needed. 3. No discussion on generalization to ultra-high-resolution images or low-resource devices (e.g., edge GPUs), restricting scope insights.
1. The motivation is good. Previous methods addressed the problem from the perspective of multiple scans, whereas this paper innovatively addresses the Mamba architecture's issue of sequential reasoning by focusing on a single-scan approach for causal information swapping. 2. By restructuring the tensor dimensions of batch data, the paper improves parallelism from a GPU computation perspective, which benefits the acceleration of large-scale training. 3. The experiments are comprehensive, cover
1. The performance improvement of the model is not significant. The accuracy improvement is not obvious compared to the baseline Mamvbavision. Moreover, the gain in speed is not so significant. 2. The comparison models do not include Fast R-CNN or Faster R-CNN for the object detection experiment, and the comparison in the validation set is insufficient. 3. The method of exchanging token positions to achieve contextual structure interaction may not be the optimal approach. The current ablation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
