MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition
Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

TL;DR
MoMa introduces a novel adapter framework that enhances image foundation models for video recognition by fully modeling spatial-temporal dynamics efficiently, outperforming prior methods on multiple benchmarks.
Contribution
MoMa uniquely integrates Mamba's state space modeling into IFMs using SeqMod, enabling comprehensive spatial-temporal understanding with minimal computational overhead.
Findings
Achieves superior accuracy on video benchmarks.
Maintains computational efficiency compared to existing methods.
Effectively captures complex video dynamics.
Abstract
Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
