Video Generation from Text Employing Latent Path Construction for Temporal Modeling
Amir Mazaheri, Mubarak Shah

TL;DR
This paper introduces a novel method for generating realistic videos from natural language descriptions by regressing latent representations and employing a progressive upsampling approach, outperforming existing baselines on complex datasets.
Contribution
It presents the first approach for text-to-video generation on realistic datasets, using latent path construction and a stacking upPooling block for progressive frame synthesis.
Findings
Outperforms RNN and deconvolution-based methods
Effective in generating videos from complex natural language descriptions
Capable of handling realistic datasets like A2D and UCF101
Abstract
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study. In this paper, we tackle the text to video generation problem, which is a conditional form of video generation. Humans can listen/read natural language sentences, and can imagine or visualize what is being described; therefore, we believe that video generation from natural language sentences will have an important impact on Artificial Intelligence. Video generation is relatively a new field of study in Computer Vision, which is far from being solved. The majority of recent works deal with synthetic datasets or real datasets with very limited types of objects, scenes, and emotions. To the best of our knowledge, this is the very first work on the text (free-form sentences) to video generation on more realistic video datasets like Actor and Action Dataset (A2D) or UCF101. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition
