Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui, Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben, Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge,, Yonghui Wu

TL;DR
The paper introduces Parti, a large-scale autoregressive model for content-rich text-to-image generation, leveraging sequence modeling and a Transformer-based image tokenizer to produce high-quality, photorealistic images from complex prompts.
Contribution
It presents a novel sequence-to-sequence approach for text-to-image synthesis, scaling Transformer models up to 20B parameters and introducing a new benchmark for evaluation.
Findings
Achieved state-of-the-art zero-shot FID score of 7.23
Attained a finetuned FID score of 3.22 on MS-COCO
Demonstrated effectiveness across diverse categories and prompt complexities
Abstract
We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Parti - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Paper Explained)· youtube
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Residual Connection
