A Survey on Mamba Architecture for Vision Applications

Fady Ibrahim; Guangjun Liu; Guanghui Wang

arXiv:2502.07161·cs.CV·February 12, 2025·3 cites

A Survey on Mamba Architecture for Vision Applications

Fady Ibrahim, Guangjun Liu, Guanghui Wang

PDF

Open Access

TL;DR

This paper surveys the Mamba architecture, a scalable and efficient transformer-based framework utilizing state-space models for advanced visual tasks like object detection and video understanding.

Contribution

It provides a comprehensive overview of recent Mamba architecture developments, including Vision Mamba and VideoMamba, highlighting innovations for improved visual processing.

Findings

01

Mamba architecture offers linear scalability for vision tasks.

02

Recent advancements enhance image and video understanding.

03

Architectural innovations improve feature extraction efficiency.

Abstract

Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications

MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces