VSC: Visual Search Compositional Text-to-Image Diffusion Model
Do Huu Dat, Nam Hyeonu, Po-Yuan Mao, Tae-Hyun Oh

TL;DR
This paper introduces VSC, a novel compositional method for text-to-image diffusion that improves attribute-object binding in complex prompts by decomposing prompts, generating sub-images, and refining representations, outperforming existing models.
Contribution
The paper proposes a new compositional generation approach using pairwise image embeddings and segmentation-based training to enhance attribute-object binding in diffusion models.
Findings
Outperforms existing models on T2I CompBench benchmark.
Achieves better image quality as rated by humans.
Demonstrates robustness with increasing prompt complexity.
Abstract
Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Text Analysis Techniques
MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training
