Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Ahmad S\"uleyman, G\"oksel Biricik

TL;DR
ObjectDiffusion is a novel model that enhances text-to-image diffusion by integrating semantic and spatial grounding, enabling precise control over object placement and improving image quality and diversity.
Contribution
The paper introduces ObjectDiffusion, a new approach that conditions diffusion models on grounding information, combining ControlNet and GLIGEN techniques for improved controllable image synthesis.
Findings
Achieves state-of-the-art metrics on COCO2017 dataset
Demonstrates strong grounding and control in diverse contexts
Produces high-fidelity images with precise object placement
Abstract
Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
