Video Generation From Text
Yitong Li, Martin Renqiang Min, Dinghan Shen, David Carlson, Lawrence, Carin

TL;DR
This paper introduces a hybrid VAE-GAN framework for generating videos from text, effectively capturing static and dynamic scene features, and demonstrates superior performance over baseline models.
Contribution
The paper presents a novel hybrid VAE-GAN model for text-to-video generation and a method to automatically create a large text-video dataset from online videos.
Findings
Generated videos are plausible and diverse.
The model outperforms baseline text-to-video methods.
Evaluation shows improved quality and accuracy.
Abstract
Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called "gist," are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse videos, while accurately reflecting the input text information. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
