GroundingBooth: Grounding Text-to-Image Customization
Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan, Jacobs

TL;DR
GroundingBooth is a novel method for zero-shot, instance-level spatial grounding in text-to-image customization, enabling precise layout control, identity preservation, and multi-subject personalization in generated images.
Contribution
It introduces a grounding module and cross-attention layer for accurate spatial control and supports multiple subjects, advancing text-to-image customization capabilities.
Findings
Achieves accurate spatial grounding in zero-shot settings.
Supports multiple subjects in personalized image synthesis.
Demonstrates strong results in layout-guided and customization tasks.
Abstract
Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.
Peer Reviews
Decision·Submitted to ICLR 2025
1. The visualization results show that the proposed method can effectively preserve the identity of reference image while generating plausible images. 2. The proposed method is able to simultaneously handle multi-object synthesis even with complex layout.
1. My main concern is that the authors claim that they are able to ground the text entities during generation. While the CLIP-T score of the model indicates that the generated image is less coherent with the text comparing to other baseline methods. 2. While the paper claimed that they can control the spatial relationship between objects. It is difficult to evaluate this argument given the layouts are pre-determined. 3. How are the metrics computed? For example, when computing the CLIP-I score,
* The paper is well written and presented nicely * The method improves over the baselines it does test (see first weakness) * Such model can be useful in many real-life applications
* The paper does not cover “Break-A-Scene: Extracting Multiple Concepts from a Single Image” by Avrahami et al (2023). In this work, they extract concepts from an image using textual inversion, and use it to embed them in new images. They too work with masks and can even accept them from the user as input. This is especially important since the sentence before last in the abstract states “Our work is the first work to achieve a joint grounding of both subject-driven foreground generation and tex
Unlike many existing layout-guided image generation methods that handle only single subjects, GroundingBooth supports multi-subject customization. This versatility broadens its applicability, especially for generating images where complex layouts and multiple subjects are essential.
1. InstanceDiffusion does not exist in baseline comparisons. Despite its notable relevance with capabilities for free-form language conditions per instance and flexible instance localization methods (single points, scribbles, and bounding boxes), InstanceDiffusion is missing from both our quantitative and qualitative baselines. 2. FID, in contrast to other works dealing with similar tasks, is not suggested in this paper. 3. Qualitative results demonstrating the model's performance on multi-subj
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
MethodsFocus
