DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention   and Text Guidance

Cong Wang; Jiaxi Gu; Panwen Hu; Songcen Xu; Hang Xu; Xiaodan Liang

arXiv:2312.03018·cs.CV·September 17, 2024·1 cites

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Cong Wang, Jiaxi Gu, Panwen Hu, Songcen Xu, Hang Xu, Xiaodan Liang

PDF

Open Access

TL;DR

DreamVideo introduces a novel high-fidelity image-to-video generation method that preserves reference image details and allows controllable video synthesis through text prompts, outperforming existing models in fidelity and flexibility.

Contribution

The paper proposes a frame retention branch and double-condition guidance in a pre-trained video diffusion model for improved image retention and controllable video generation.

Findings

01

Outperforms state-of-the-art in fidelity, especially on UCF101.

02

Demonstrates effective control of generated videos via text prompts.

03

Achieves high temporal consistency and image detail preservation.

Abstract

Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsDiffusion · Convolution