Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for   Text-to-Video Generation Task

Yiran Yang; Jinchao Zhang; Ying Deng; Jie Zhou

arXiv:2407.06617·cs.CV·July 24, 2024

Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task

Yiran Yang, Jinchao Zhang, Ying Deng, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

Mobius introduces a novel spatial-temporal parallel training paradigm for text-to-video generation, significantly reducing GPU memory and training time compared to traditional serial models, thus enhancing efficiency and environmental sustainability.

Contribution

The paper presents a new parallel training framework for T2V that optimizes feature flow, saving GPU memory and training time, and offers a fresh approach for efficient video generation.

Findings

01

Reduces GPU memory usage by 24%.

02

Cuts training time by 12%.

03

Improves efficiency of T2V fine-tuning.

Abstract

Inspired by the success of the text-to-image (T2I) generation task, many researchers are devoting themselves to the text-to-video (T2V) generation task. Most of the T2V frameworks usually inherit from the T2I model and add extra-temporal layers of training to generate dynamic videos, which can be viewed as a fine-tuning task. However, the traditional 3D-Unet is a serial mode and the temporal layers follow the spatial layers, which will result in high GPU memory and training time consumption according to its serial feature flow. We believe that this serial mode will bring more training costs with the large diffusion model and massive datasets, which are not environmentally friendly and not suitable for the development of the T2V. Therefore, we propose a highly efficient spatial-temporal parallel training paradigm for T2V tasks, named Mobius. In our 3D-Unet, the temporal layers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

youngfly/Mobius
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization · Handwritten Text Recognition Techniques

MethodsDiffusion