Make-Your-Video: Customized Video Generation Using Textual and   Structural Guidance

Jinbo Xing; Menghan Xia; Yuxin Liu; Yuechen Zhang; Yong Zhang,; Yingqing He; Hanyuan Liu; Haoxin Chen; Xiaodong Cun; Xintao Wang; Ying Shan,; Tien-Tsin Wong

arXiv:2306.00943·cs.CV·June 2, 2023·2 cites

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang,, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan,, Tien-Tsin Wong

PDF

Open Access

TL;DR

Make-Your-Video is a novel method for customized video generation that combines text descriptions and structural guidance using a latent diffusion model, enhancing control, coherence, and fidelity in generated videos.

Contribution

The paper introduces a joint-conditional video generation approach with a two-stage training scheme and a causal attention mask strategy, improving quality and efficiency over existing methods.

Findings

01

Outperforms baselines in temporal coherence and fidelity

02

Enables longer video synthesis with quality preservation

03

Demonstrates practical applications of customized video generation

Abstract

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Motion and Animation

MethodsDiffusion · Latent Diffusion Model