Generating Intermediate Representations for Compositional Text-To-Image   Generation

Ran Galun; Sagie Benaim

arXiv:2410.09792·cs.CV·October 22, 2024

Generating Intermediate Representations for Compositional Text-To-Image Generation

Ran Galun, Sagie Benaim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage, compositional diffusion approach for text-to-image generation that produces intermediate representations to better capture spatial details, leading to improved image quality.

Contribution

It presents a novel two-stage diffusion-based method that generates intermediate representations to enhance spatial accuracy in text-to-image synthesis.

Findings

01

Improved FID score over baseline

02

Comparable CLIP score to baseline

03

Enhanced spatial detail in generated images

Abstract

Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a compositional approach for text-to-image generation based on two stages. In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations (such as depth or segmentation maps) conditioned on text. In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model. Our findings indicate that such compositional approach can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rang1991/public-intermediate-semantics-for-generation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimedia Communication and Technology

MethodsDiffusion · Contrastive Language-Image Pre-training