Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, Zhenbang Sun

TL;DR
Mamoda2.5 is a unified multimodal diffusion model with a Mixture-of-Experts backbone that achieves high performance and efficiency in video editing and content moderation tasks.
Contribution
It introduces a scalable MoE-based diffusion transformer architecture and a novel distillation and reinforcement learning framework for fast, high-quality video editing.
Findings
Achieves top-tier performance on VBench 2.0 and OpenVE-Bench.
Surpasses open-source models and matches top proprietary models in video editing quality.
Accelerates inference by up to 95.9 times compared to baselines.
Abstract
We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
