M4V: Multi-Modal Mamba for Text-to-Video Generation
Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma

TL;DR
M4V introduces a multi-modal, efficient Transformer-based framework for text-to-video generation that reduces computational costs and enhances visual quality through innovative token re-composition and reward learning strategies.
Contribution
The paper presents M4V, a novel multi-modal Mamba architecture that integrates multi-modal information and spatiotemporal modeling, reducing FLOPs by 45% and improving video quality in text-to-video synthesis.
Findings
Reduces FLOPs by 45% at 768×1280 resolution.
Achieves high-quality video generation with lower computational costs.
Enhances visual realism using a reward learning strategy.
Abstract
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications
MethodsDiffusion · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
