Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer; Adam Polyak; Thomas Hayes; Xi Yin; Jie An; Songyang; Zhang; Qiyuan Hu; Harry Yang; Oron Ashual; Oran Gafni; Devi Parikh; Sonal; Gupta; Yaniv Taigman

arXiv:2209.14792·cs.CV·September 30, 2022·313 cites

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang, Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal, Gupta, Yaniv Taigman

PDF

Open Access 2 Repos 1 Models 3 Videos

TL;DR

Make-A-Video leverages advances in text-to-image models to generate diverse, high-quality videos from text without requiring paired text-video data, by decomposing temporal and spatial features and employing a novel spatial-temporal pipeline.

Contribution

It introduces a new method for text-to-video generation that does not need paired text-video data and builds on existing text-to-image models with innovative spatial-temporal modules.

Findings

01

Sets new state-of-the-art in text-to-video generation

02

Produces high-resolution, high-frame-rate videos

03

Achieves diverse and faithful video synthesis

Abstract

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
TempoFunk/makeavid-sd-jax
model· 27 dl· ♡ 10
27 dl♡ 10

Videos

Movie Diffusion explained | Make-a-Video from MetaAI and Imagen Video from Google Brain· youtube

Make-A-Video: Text-To-Video Generation Without Text-Video Data | Paper Explained· youtube

Make-A-Video: Text-to-Video Generation without Text-Video Data· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Convolution · U-Net