Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu; Yuanzhong Xu; Jing Yu Koh; Thang Luong; Gunjan Baid; Zirui; Wang; Vijay Vasudevan; Alexander Ku; Yinfei Yang; Burcu Karagol Ayan; Ben; Hutchinson; Wei Han; Zarana Parekh; Xin Li; Han Zhang; Jason Baldridge,; Yonghui Wu

arXiv:2206.10789·cs.CV·June 23, 2022·340 cites

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui, Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben, Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge,, Yonghui Wu

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

The paper introduces Parti, a large-scale autoregressive model for content-rich text-to-image generation, leveraging sequence modeling and a Transformer-based image tokenizer to produce high-quality, photorealistic images from complex prompts.

Contribution

It presents a novel sequence-to-sequence approach for text-to-image synthesis, scaling Transformer models up to 20B parameters and introducing a new benchmark for evaluation.

Findings

01

Achieved state-of-the-art zero-shot FID score of 7.23

02

Attained a finetuned FID score of 3.22 on MS-COCO

03

Demonstrated effectiveness across diverse categories and prompt complexities

Abstract

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Deci/DeciDiffusion-v1-0
model· 23 dl· ♡ 140
23 dl♡ 140

Videos

Parti - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Paper Explained)· youtube

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Residual Connection