MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang; Zhongyi Fan; Yonghang Zhang; Zhangzikang Li; Weifeng Chen; Zhongwei Feng; Chaoyue Wang; Peng Hou; Anxiang Zeng

arXiv:2510.17519·cs.CV·October 23, 2025

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng

PDF

Open Access 1 Models

TL;DR

This paper introduces MUG-V 10B, a highly efficient training pipeline for large-scale video generation models that achieves state-of-the-art performance and is openly available for research and development.

Contribution

The paper presents a comprehensive training framework optimizing data, architecture, strategy, and infrastructure, enabling efficient training and superior performance of large video generation models.

Findings

01

MUG-V 10B matches state-of-the-art video generators.

02

Surpasses open-source baselines on e-commerce video tasks.

03

Achieves high training efficiency with near-linear multi-node scaling.

Abstract

In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MUG-V/MUG-V-inference
model· 13 dl· ♡ 7
13 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis