Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang,, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan,, Tien-Tsin Wong

TL;DR
Make-Your-Video is a novel method for customized video generation that combines text descriptions and structural guidance using a latent diffusion model, enhancing control, coherence, and fidelity in generated videos.
Contribution
The paper introduces a joint-conditional video generation approach with a two-stage training scheme and a causal attention mask strategy, improving quality and efficiency over existing methods.
Findings
Outperforms baselines in temporal coherence and fidelity
Enables longer video synthesis with quality preservation
Demonstrates practical applications of customized video generation
Abstract
Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Motion and Animation
MethodsDiffusion · Latent Diffusion Model
