GroundingBooth: Grounding Text-to-Image Customization

Zhexiao Xiong; Wei Xiong; Jing Shi; He Zhang; Yizhi Song; Nathan; Jacobs

arXiv:2409.08520·cs.CV·March 18, 2025

GroundingBooth: Grounding Text-to-Image Customization

Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan, Jacobs

PDF

Open Access 3 Reviews

TL;DR

GroundingBooth is a novel method for zero-shot, instance-level spatial grounding in text-to-image customization, enabling precise layout control, identity preservation, and multi-subject personalization in generated images.

Contribution

It introduces a grounding module and cross-attention layer for accurate spatial control and supports multiple subjects, advancing text-to-image customization capabilities.

Findings

01

Achieves accurate spatial grounding in zero-shot settings.

02

Supports multiple subjects in personalized image synthesis.

03

Demonstrates strong results in layout-guided and customization tasks.

Abstract

Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The visualization results show that the proposed method can effectively preserve the identity of reference image while generating plausible images. 2. The proposed method is able to simultaneously handle multi-object synthesis even with complex layout.

Weaknesses

1. My main concern is that the authors claim that they are able to ground the text entities during generation. While the CLIP-T score of the model indicates that the generated image is less coherent with the text comparing to other baseline methods. 2. While the paper claimed that they can control the spatial relationship between objects. It is difficult to evaluate this argument given the layouts are pre-determined. 3. How are the metrics computed? For example, when computing the CLIP-I score,

Reviewer 02Rating 6Confidence 4

Strengths

* The paper is well written and presented nicely * The method improves over the baselines it does test (see first weakness) * Such model can be useful in many real-life applications

Weaknesses

* The paper does not cover “Break-A-Scene: Extracting Multiple Concepts from a Single Image” by Avrahami et al (2023). In this work, they extract concepts from an image using textual inversion, and use it to embed them in new images. They too work with masks and can even accept them from the user as input. This is especially important since the sentence before last in the abstract states “Our work is the first work to achieve a joint grounding of both subject-driven foreground generation and tex

Reviewer 03Rating 6Confidence 3

Strengths

Unlike many existing layout-guided image generation methods that handle only single subjects, GroundingBooth supports multi-subject customization. This versatility broadens its applicability, especially for generating images where complex layouts and multiple subjects are essential.

Weaknesses

1. InstanceDiffusion does not exist in baseline comparisons. Despite its notable relevance with capabilities for free-form language conditions per instance and flexible instance localization methods (single points, scribbles, and bounding boxes), InstanceDiffusion is missing from both our quantitative and qualitative baselines. 2. FID, in contrast to other works dealing with similar tasks, is not suggested in this paper. 3. Qualitative results demonstrating the model's performance on multi-subj

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques

MethodsFocus