Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang,, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara, Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet,, Mohammad Norouzi

TL;DR
Imagen is a state-of-the-art text-to-image diffusion model that combines large language understanding with high-fidelity image generation, achieving photorealism and superior alignment with text prompts, without training on target datasets.
Contribution
The paper introduces Imagen, a novel diffusion model that leverages large pretrained language models for improved text understanding and image synthesis, setting new benchmarks in photorealism and alignment.
Findings
Achieves a new SOTA FID score of 7.27 on COCO without training on it.
Human raters find Imagen samples comparable to real COCO images.
Outperforms recent models like DALL-E 2 in quality and alignment.
Abstract
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stable-diffusion-v1-5/stable-diffusion-v1-5model· 1.7M dl· ♡ 10661.7M dl♡ 1066
- 🤗CompVis/stable-diffusion-v1-4model· 468k dl· ♡ 6991468k dl♡ 6991
- 🤗CompVis/stable-diffusion-v-1-4-originalmodel· ♡ 2843♡ 2843
- 🤗jm12138/riffusion-model-v1model· ♡ 3♡ 3
- 🤗CompVis/stable-diffusion-v-1-1-originalmodel· ♡ 19♡ 19
- 🤗CompVis/stable-diffusion-v-1-2-originalmodel· ♡ 14♡ 14
- 🤗CompVis/stable-diffusion-v-1-3-originalmodel· 24 dl· ♡ 1924 dl♡ 19
- 🤗CompVis/stable-diffusion-v1-3model· 50 dl· ♡ 3950 dl♡ 39
- 🤗CompVis/stable-diffusion-v1-1model· 1.5k dl· ♡ 811.5k dl♡ 81
- 🤗CompVis/stable-diffusion-v1-2model· 61 dl· ♡ 4061 dl♡ 40
Videos
Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models illustrated· youtube
What's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Computational and Text Analysis Methods
MethodsDiffusion
