Phenaki: Variable Length Video Generation From Open Domain Textual   Description

Ruben Villegas; Mohammad Babaeizadeh; Pieter-Jan Kindermans; Hernan; Moraldo; Han Zhang; Mohammad Taghi Saffar; Santiago Castro; Julius Kunze,; Dumitru Erhan

arXiv:2210.02399·cs.CV·October 6, 2022·79 cites

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan, Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze,, Dumitru Erhan

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

Phenaki is a novel model that enables realistic, variable-length video generation from open domain text prompts by using a discrete token-based representation and joint training on image and video datasets.

Contribution

It introduces a new video representation learning method with a causal attention tokenizer and a bidirectional transformer conditioned on text, allowing for flexible, long video generation from prompts.

Findings

01

Generates arbitrary long videos conditioned on prompt sequences

02

Achieves better spatio-temporal consistency than per-frame baselines

03

Generalizes beyond available video datasets through joint training

Abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
obvious-research/phenaki-cvivit
model· ♡ 6
♡ 6

Videos

Generate long form video with Transformers | Phenaki from Google Brain explained· youtube

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Motion and Animation