VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang; Tianxing Wu; Shuai Yang; Chenyang Si; Dahua Lin; Yu; Qiao; Chen Change Loy; Ziwei Liu

arXiv:2312.00777·cs.CV·December 4, 2023·1 cites

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu, Qiao, Chen Change Loy, Ziwei Liu

PDF

Open Access

TL;DR

VideoBooth introduces a novel diffusion-based framework for video generation guided by image prompts, enabling high-quality, customizable videos with improved appearance fidelity and temporal consistency.

Contribution

The paper presents a new feed-forward approach that embeds image prompts at multiple scales and integrates them into the attention mechanism for enhanced video generation.

Findings

01

Achieves state-of-the-art results in customized video quality.

02

Maintains temporal consistency across frames.

03

Generalizes well to diverse image prompts.

Abstract

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization