Improving Compositional Text-to-image Generation with Large   Vision-Language Models

Song Wen; Guian Fang; Renrui Zhang; Peng Gao; Hao Dong; Dimitris; Metaxas

arXiv:2310.06311·cs.CV·October 11, 2023·1 cites

Improving Compositional Text-to-image Generation with Large Vision-Language Models

Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris, Metaxas

PDF

Open Access

TL;DR

This paper introduces a novel approach combining large vision-language models with diffusion models to improve compositional text-to-image generation, achieving better alignment with complex input descriptions.

Contribution

The paper proposes a multi-stage method that uses LVLMs for assessment and correction, significantly enhancing the quality and accuracy of generated images in compositional tasks.

Findings

01

Improved alignment with complex input texts

02

Enhanced object and attribute accuracy in generated images

03

Better spatial relationship representation

Abstract

Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsALIGN · Diffusion