Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Jos\'e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk, Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu,, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A., Ross, Lu Jiang

TL;DR
This paper introduces MAGVIT-v2, a new video tokenizer that enables large language models to outperform diffusion models in image and video generation, compression, and action recognition tasks.
Contribution
The paper presents MAGVIT-v2, a novel video tokenizer that improves visual generation and representation learning for LLMs, surpassing previous tokenizers and diffusion models.
Findings
LLMs outperform diffusion models on ImageNet and Kinetics benchmarks.
The tokenizer achieves video compression comparable to advanced codecs.
Effective representations for action recognition are learned using the new tokenizer.
Abstract
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TencentARC/Open-MAGVIT2model· ♡ 14♡ 14
- 🤗Ziyaad30/Pyramid-Flow-sd3model· ♡ 2♡ 2
- 🤗CofeAI/O2-MAGVIT2-previewmodel· ♡ 1♡ 1
- 🤗TencentARC/Open-MAGVIT2-Tokenizer-128-resolutionmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗TencentARC/Open-MAGVIT2-Tokenizer-256-resolutionmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗TencentARC/Open-MAGVIT2-AR-B-256-resolutionmodel
- 🤗TencentARC/Open-MAGVIT2-AR-L-256-resolutionmodel· 1 dl1 dl
- 🤗TencentARC/Open-MAGVIT2-AR-XL-256-resolutionmodel· ♡ 1♡ 1
- 🤗GrayShine/WeTokmodel· ♡ 2♡ 2
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education
MethodsDiffusion
