LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image   Diffusion Models with Large Language Models

Long Lian; Boyi Li; Adam Yala; Trevor Darrell

arXiv:2305.13655·cs.CV·March 5, 2024·23 cites

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Long Lian, Boyi Li, Adam Yala, Trevor Darrell

PDF

Open Access 2 Repos 6 Models

TL;DR

This paper introduces a two-stage method combining large language models and diffusion models to improve prompt understanding and image generation accuracy for complex, multi-language, and multi-round scene prompts.

Contribution

It presents a novel grounded generation approach that leverages pretrained LLMs and diffusion models without additional training, significantly enhancing prompt comprehension and image accuracy.

Findings

01

Doubling accuracy across four tasks

02

Enabling multi-round scene specification

03

Handling prompts in multiple languages

Abstract

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques

MethodsDiffusion · Balanced Selection