Magic 1-For-1: Generating One Minute Video Clips within One Minute
Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael, Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou

TL;DR
Magic 1-For-1 introduces an efficient method for generating one-minute video clips in under a minute by decomposing the task into simpler steps and applying various optimization techniques to reduce computational costs.
Contribution
The paper proposes a novel two-step diffusion-based approach for fast text-to-video generation, optimizing memory and inference speed, and demonstrating high-quality, long-duration video synthesis.
Findings
Generated 5-second videos in 3 seconds.
Produced one-minute videos within one minute.
Achieved improved visual quality and motion dynamics.
Abstract
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
NVIDIA’s New AI: The Age of Real Time Game Making Is Here!· youtube
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Video Coding and Compression Technologies
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
