M4V: Multi-Modal Mamba for Text-to-Video Generation

Jiancheng Huang; Gengwei Zhang; Zequn Jie; Siyu Jiao; Yinlong Qian; Ling Chen; Yunchao Wei; Lin Ma

arXiv:2506.10915·cs.CV·June 13, 2025

M4V: Multi-Modal Mamba for Text-to-Video Generation

Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma

PDF

Open Access

TL;DR

M4V introduces a multi-modal, efficient Transformer-based framework for text-to-video generation that reduces computational costs and enhances visual quality through innovative token re-composition and reward learning strategies.

Contribution

The paper presents M4V, a novel multi-modal Mamba architecture that integrates multi-modal information and spatiotemporal modeling, reducing FLOPs by 45% and improving video quality in text-to-video synthesis.

Findings

01

Reduces FLOPs by 45% at 768×1280 resolution.

02

Achieves high-quality video generation with lower computational costs.

03

Enhances visual realism using a reward learning strategy.

Abstract

Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications

MethodsDiffusion · Mamba: Linear-Time Sequence Modeling with Selective State Spaces