Text-driven Video Prediction
Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang

TL;DR
This paper introduces Text-driven Video Prediction, a new task that generates future video frames from an initial image and descriptive text, emphasizing the causal influence of text on motion and appearance.
Contribution
The paper proposes a novel framework with a Text Inference Module to leverage text for controlling motion in video prediction, addressing the lack of deterministic constraints in existing models.
Findings
Outperforms baseline models on Something-Something V2 dataset
Effectively utilizes text for causal motion inference
Produces coherent video sequences based on textual descriptions
Abstract
Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
