VSC: Visual Search Compositional Text-to-Image Diffusion Model

Do Huu Dat; Nam Hyeonu; Po-Yuan Mao; Tae-Hyun Oh

arXiv:2505.01104·cs.CV·May 5, 2025

VSC: Visual Search Compositional Text-to-Image Diffusion Model

Do Huu Dat, Nam Hyeonu, Po-Yuan Mao, Tae-Hyun Oh

PDF

Open Access

TL;DR

This paper introduces VSC, a novel compositional method for text-to-image diffusion that improves attribute-object binding in complex prompts by decomposing prompts, generating sub-images, and refining representations, outperforming existing models.

Contribution

The paper proposes a new compositional generation approach using pairwise image embeddings and segmentation-based training to enhance attribute-object binding in diffusion models.

Findings

01

Outperforms existing models on T2I CompBench benchmark.

02

Achieves better image quality as rated by humans.

03

Demonstrates robustness with increasing prompt complexity.

Abstract

Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Text Analysis Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training