MAGVIT: Masked Generative Video Transformer

Lijun Yu; Yong Cheng; Kihyuk Sohn; Jos\'e Lezama; Han Zhang; Huiwen; Chang; Alexander G. Hauptmann; Ming-Hsuan Yang; Yuan Hao; Irfan Essa; Lu; Jiang

arXiv:2212.05199·cs.CV·April 6, 2023

MAGVIT: Masked Generative Video Transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos\'e Lezama, Han Zhang, Huiwen, Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu, Jiang

PDF

Open Access 1 Repo

TL;DR

MAGVIT is a versatile video transformer model that excels in multiple video synthesis tasks, offering high quality, efficiency, and cross-domain generalization, outperforming existing methods on several benchmarks.

Contribution

Introduces MAGVIT, a unified model with a 3D tokenizer and masked token modeling for multi-task video synthesis, achieving state-of-the-art results and fast inference.

Findings

01

Achieves best FVD on three benchmarks including Kinetics-600.

02

Outperforms diffusion and autoregressive models in inference speed.

03

Supports ten diverse video generation tasks with cross-domain generalization.

Abstract

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/magvit
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Dropout · Byte Pair Encoding · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Residual Connection