Compositional Text-to-Image Generation with Dense Blob Representations
Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart,, Arash Vahdat

TL;DR
This paper introduces BlobGEN, a novel compositional text-to-image generation model that uses dense blob representations for fine-grained, controllable scene synthesis, leveraging large language models for improved prompt understanding.
Contribution
The paper proposes dense blob representations for scene decomposition, a blob-grounded diffusion model, and an in-context learning approach with LLMs for enhanced compositional generation.
Findings
Achieves superior zero-shot generation quality on MS-COCO
Demonstrates better layout-guided controllability
Exhibits improved numerical and spatial correctness on benchmarks
Abstract
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsSoftmax · Concatenated Skip Connection · Diffusion
