VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan; Yunzhi Zhang; Pieter Abbeel; Aravind Srinivas

arXiv:2104.10157·cs.CV·September 16, 2021·144 cites

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas

PDF

Open Access 3 Repos 1 Models

TL;DR

VideoGPT introduces a simple, scalable architecture combining VQ-VAE and transformers for natural video generation, achieving competitive results with state-of-the-art models on multiple datasets.

Contribution

The paper presents a minimalistic, likelihood-based transformer architecture for video generation that is easy to train and reproduces high-quality videos.

Findings

01

Generates competitive video samples on BAIR Robot dataset.

02

Produces high-fidelity videos on UCF-101 and TGIF datasets.

03

Offers a reproducible, simple implementation for transformer-based video generation.

Abstract

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
hpcai-tech/vqvae
model· 8 dl· ♡ 6
8 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsVQ-VAE