Autoregressive Model Beats Diffusion: Llama for Scalable Image   Generation

Peize Sun; Yi Jiang; Shoufa Chen; Shilong Zhang; Bingyue Peng; Ping; Luo; Zehuan Yuan

arXiv:2406.06525·cs.CV·June 11, 2024·5 cites

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping, Luo, Zehuan Yuan

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper demonstrates that large autoregressive models like Llama can achieve state-of-the-art image generation performance, challenging the dominance of diffusion models through extensive scaling and design optimizations.

Contribution

It introduces LlamaGen, a family of autoregressive image generation models that outperform diffusion models on benchmark datasets, with innovations in tokenization, scaling, and training methods.

Findings

01

Achieved 2.18 FID on ImageNet 256x256 benchmarks.

02

Developed an image tokenizer with 97% codebook usage.

03

Realized 326%-414% inference speedup using LLM serving frameworks.

Abstract

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
FoundationVision/LlamaGen
model· ♡ 29
♡ 29

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion