Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo,, Kaipeng Zhang, Rongrong Ji

TL;DR
Diffree is a novel diffusion-based model that enables seamless, text-guided addition of objects into images without requiring bounding boxes or masks, maintaining visual consistency and relevance.
Contribution
We introduce Diffree, a text-guided object addition method trained on a new synthetic dataset, which predicts object placement and integrates objects seamlessly using only text prompts.
Findings
High success rate in object addition
Maintains background and spatial consistency
Outperforms existing methods in relevance and quality
Abstract
This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object…
Peer Reviews
Decision·Submitted to ICLR 2025
- Object inpainting using only text without relying on shape constraints. - The authors built an OABench to facilitate text-guided object inpainting. - The results are attractive and the supported applications are interesting.
- Lack of ablation of the validation model design. For example, what happens to the output if the OMP is removed? BTW, integrating a mask head in the diffusion process is not new, e.g. in [1]. - The comparison is slightly unfair. In Figure-A10, comparing the model you trained on the curated dataset to other methods that were not retrained or fine-tuned is unfair. - This paper is poorly written, especially the explanation of the charts. For example, in line 50 of the description section, the auth
1. Diffree offers a user-friendly approach to inserting objects into images. The mask-free object insertion is particularly useful in practical applications. 2. The creation of OABench, a large-scale synthetic dataset, is a significant contribution, providing a rich resource for training and evaluating object addition models. 3. The OMP module's ability to predict the target mask and generate inpainting results simultaneously is a novel architectural advancement in this field.
1. [1] proposed a method for mask prediction closely related to Diffree. An in-depth analysis and comparison with this work would be beneficial. 2. All the prompts used in this paper are in the form "add {object}". It is unclear how Diffree generalizes to more precise control, such as "add a dragon in the room". 3. While user-friendly for object insertion, it restricts users from adjusting the mask. In standard image processing, users or designers often need to make adjustments to achieve their
- **Originality**: Diffree’s approach of shape-free object addition guided solely by text is novel, significantly enhancing usability by eliminating the need for manual mask definitions. This innovation in user experience represents a unique contribution. - **Clarity**: Overall, the dataset creation process, and evaluation metrics are well-described, with figures that aid understanding of Diffree’s operational and comparative performance.
- **Minor Typographical Issue**: There is a missing period at the end of line 101, which should be corrected for clarity. - **Dataset Limitation in Prompt Detail**: Since the dataset primarily relies on the COCO dataset, prompts are often generic object labels rather than detailed, fine-grained descriptions. This limitation can hinder the model's ability to respond to nuanced or interactive prompts, such as requests for specific object attributes or context-based interactions. - **Methodology Cl
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · Inpainting
