VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas

TL;DR
VideoGPT introduces a simple, scalable architecture combining VQ-VAE and transformers for natural video generation, achieving competitive results with state-of-the-art models on multiple datasets.
Contribution
The paper presents a minimalistic, likelihood-based transformer architecture for video generation that is easy to train and reproduces high-quality videos.
Findings
Generates competitive video samples on BAIR Robot dataset.
Produces high-fidelity videos on UCF-101 and TGIF datasets.
Offers a reproducible, simple implementation for transformer-based video generation.
Abstract
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsVQ-VAE
