V2M: Visual 2-Dimensional Mamba for Image Representation Learning
Chengkun Wang, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu

TL;DR
V2M introduces a novel 2D spatially-aware Mamba model for image representation learning, directly processing image tokens in 2D to better preserve local structure and improve performance on various visual tasks.
Contribution
The paper proposes a new 2D extension of the Mamba model using 2D state space modeling, enabling efficient and locality-aware image processing.
Findings
Outperforms other visual backbones on ImageNet classification.
Effective in downstream tasks like object detection and segmentation.
Maintains hardware efficiency and scalability.
Abstract
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2-Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2-dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on…
Peer Reviews
Decision·Submitted to ICLR 2025
- Developing high-performance, high-dimensional state-space models is an important topic, and the paper is an interesting attempt in this direction. - The results in the paper are generally good, demonstrating strong performance on ImageNet classification, COCO detection, and segmentation compared to strong baselines such as ViM, LocalMamba, and VMamba.
- One of the essential advantages of Mamba is its high efficiency in processing long sequences. The paper does not show V2M's performance on long sequences. - V2M's runtime is not reported. Given that hardware-efficient implementation is an important part of V2M, comparing its runtime with 1D Mamba is crucial.
- The consideration of 2d spatial relationship in the visual domain is intuitive and sounds reasonable. - Different from other hierarchical vision mamba methods, this paper handles the visual perception with a plain, non-hierarchical architecture, maintaining the ability in multimodality applications. - Compared with the baseline method, this paper presents significant improvements in both classification and dense prediction tasks. - This paper aims at a good question. Scanning strategy with SSM
- The character corners in Figure 1 require further explanation. - Note that function names in formulas are usually typeset properly in Roman font, e.g., `rot`, `concat`, `SSM`, `Linear`, `sum` in Eq. 13~15. - The paper references preprints and arXiv versions of significant works, such as Mamba (COLM), Vision Mamba (ICML), and VMamba (NeurIPS). The authors should update these citations to their final published versions to reflect the current state of the literature.
This paper innovatively introduces a 2D state space to address the spatial coherence issues faced by SSM models in the visual domain, achieving promising results.
1.Novelty The novelty is moderate, as there has already been substantial research, such as VMamba[1], exploring how sequence models can preserve 2D structural information in visual tasks. 2.Implementation While this paper introduces a theory of 2D state space to address coherence issues in visual tasks, its implementation relies on excessive simplification, lacking a detailed explanation of the rationale behind this approach and its potential implications. 3.Performance The performance dem
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need · Focus · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
