Video Generation from Text Employing Latent Path Construction for   Temporal Modeling

Amir Mazaheri; Mubarak Shah

arXiv:2107.13766·cs.CV·July 30, 2021·1 cites

Video Generation from Text Employing Latent Path Construction for Temporal Modeling

Amir Mazaheri, Mubarak Shah

PDF

Open Access

TL;DR

This paper introduces a novel method for generating realistic videos from natural language descriptions by regressing latent representations and employing a progressive upsampling approach, outperforming existing baselines on complex datasets.

Contribution

It presents the first approach for text-to-video generation on realistic datasets, using latent path construction and a stacking upPooling block for progressive frame synthesis.

Findings

01

Outperforms RNN and deconvolution-based methods

02

Effective in generating videos from complex natural language descriptions

03

Capable of handling realistic datasets like A2D and UCF101

Abstract

Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study. In this paper, we tackle the text to video generation problem, which is a conditional form of video generation. Humans can listen/read natural language sentences, and can imagine or visualize what is being described; therefore, we believe that video generation from natural language sentences will have an important impact on Artificial Intelligence. Video generation is relatively a new field of study in Computer Vision, which is far from being solved. The majority of recent works deal with synthetic datasets or real datasets with very limited types of objects, scenes, and emotions. To the best of our knowledge, this is the very first work on the text (free-form sentences) to video generation on more realistic video datasets like Actor and Action Dataset (A2D) or UCF101. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition