Text-driven Video Prediction

Xue Song; Jingjing Chen; Bin Zhu; Yu-Gang Jiang

arXiv:2210.02872·cs.CV·October 7, 2022·1 cites

Text-driven Video Prediction

Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces Text-driven Video Prediction, a new task that generates future video frames from an initial image and descriptive text, emphasizing the causal influence of text on motion and appearance.

Contribution

The paper proposes a novel framework with a Text Inference Module to leverage text for controlling motion in video prediction, addressing the lack of deterministic constraints in existing models.

Findings

01

Outperforms baseline models on Something-Something V2 dataset

02

Effectively utilizes text for causal motion inference

03

Produces coherent video sequences based on textual descriptions

Abstract

Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization