AID: Adapting Image2Video Diffusion Models for Instruction-guided Video   Prediction

Zhen Xing; Qi Dai; Zejia Weng; Zuxuan Wu; Yu-Gang Jiang

arXiv:2406.06465·cs.CV·June 11, 2024·2 cites

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces AID, a novel method that adapts Image2Video diffusion models with instruction-guided control for improved video prediction, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes a new framework combining a dual query transformer and adapters to transfer pretrained Image2Video models for instruction-guided video prediction with minimal training.

Findings

01

Significant improvements in FVD scores on multiple datasets.

02

Outperforms existing state-of-the-art methods in instruction-guided video prediction.

03

Demonstrates effective transfer of video dynamic priors with minimal training.

Abstract

Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets. We observe that pretrained Image2Video diffusion models possess good priors for video dynamics but they lack textual control. Hence, transferring Image2Video models to leverage their video dynamic priors while injecting instruction control to generate controllable videos is both a meaningful and challenging task. To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Video Analysis and Summarization · Online Learning and Analytics

MethodsDiffusion