Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss,, Alec Radford, Mark Chen, Ilya Sutskever

TL;DR
This paper introduces a simple transformer-based method for zero-shot text-to-image generation that models text and image tokens jointly, achieving competitive results without complex training assumptions.
Contribution
The authors propose a straightforward autoregressive transformer approach that models text and image tokens together for zero-shot generation, simplifying previous methods.
Findings
Competitive zero-shot performance on text-to-image tasks
Achieves results comparable to domain-specific models
Simplifies the modeling approach with a single stream transformer
Abstract
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
MethodsAdam · 1-bit Adam
