Photorealistic Text-to-Image Diffusion Models with Deep Language   Understanding

Chitwan Saharia; William Chan; Saurabh Saxena; Lala Li; Jay Whang,; Emily Denton; Seyed Kamyar Seyed Ghasemipour; Burcu Karagol Ayan; S. Sara; Mahdavi; Rapha Gontijo Lopes; Tim Salimans; Jonathan Ho; David J Fleet,; Mohammad Norouzi

arXiv:2205.11487·cs.CV·May 24, 2022·2.1k cites

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang,, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara, Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet,, Mohammad Norouzi

PDF

Open Access 5 Repos 10 Models 1 Datasets 2 Videos

TL;DR

Imagen is a state-of-the-art text-to-image diffusion model that combines large language understanding with high-fidelity image generation, achieving photorealism and superior alignment with text prompts, without training on target datasets.

Contribution

The paper introduces Imagen, a novel diffusion model that leverages large pretrained language models for improved text understanding and image synthesis, setting new benchmarks in photorealism and alignment.

Findings

01

Achieves a new SOTA FID score of 7.27 on COCO without training on it.

02

Human raters find Imagen samples comparable to real COCO images.

03

Outperforms recent models like DALL-E 2 in quality and alignment.

Abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

shunk031/DrawBench
dataset· 114 dl
114 dl

Videos

Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models illustrated· youtube

What's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Computational and Text Analysis Methods

MethodsDiffusion