MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Yuhuan Yang; Chaofan Ma; Zhenjie Mao; Jiangchao Yao; Ya Zhang; Yanfeng Wang

arXiv:2506.23283·cs.CV·July 1, 2025

MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

PDF

Open Access

TL;DR

MoMa introduces a novel adapter framework that enhances image foundation models for video recognition by fully modeling spatial-temporal dynamics efficiently, outperforming prior methods on multiple benchmarks.

Contribution

MoMa uniquely integrates Mamba's state space modeling into IFMs using SeqMod, enabling comprehensive spatial-temporal understanding with minimal computational overhead.

Findings

01

Achieves superior accuracy on video benchmarks.

02

Maintains computational efficiency compared to existing methods.

03

Effectively captures complex video dynamics.

Abstract

Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging