Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel

TL;DR
This paper introduces a novel diffusion-based method for adding objects to images by first removing them and then reversing this process, leveraging large datasets and language models for improved image editing without user masks.
Contribution
The authors propose a new approach that trains a diffusion model to add objects by learning from image pairs of object removal, utilizing large-scale datasets and language models for detailed instructions.
Findings
Outperforms existing models in object addition tasks
Uses natural images instead of synthetic data
Achieves better general editing results
Abstract
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to inpainting models that benefit from segmentation mask guidance. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. It introduced of the Paint by Inpaint framework for image editing 2. It constructed PIPE, a large-scale, high-quality, mask-free, textual instruction-guided object addition image dataset. 3. The combination of Paint by Inpaint framework and the PIPE dataset can enhance the performance of adding objects to images.
1. Motivation: Currently, there are a lot of methods where new objects are added with the guidance of external signals such as bounding boxes. In this way, many attributes of the objects can be controlled, such as the size and the position of the new object. Are there any unique benefits from the proposed method when compared to these existing methods? 2. It seems that the proposed method is not very practical. Many of the attributes of the generated new object cannot be specified, such as the
- This paper introduces a dataset which is automatically generated by Stable Diffusion. This data might be helpful for the community to evaluate image editing methods. - The filtering process might be useful. Researchers can use similar filtering process to get cleaner data. - This paper provides evaluations through both automatic metrics and user study, validating that the proposed method outperforms baselines on the object addition task.
- The application scenarios of this method is too limited, as it only supports object addition. In contrast, the baselines mentioned in the paper, such as MagicBrush[1], can address various editing operations, including object addition, object replacement, object removal, action changes, and more. I suggest that the authors continue developing their method so that it can support at least 3-4 image editing operations. - Object addition without requiring user-provided input masks is not a particu
1. This paper constructs a dataset comprising original images, images with certain objects removed, and corresponding editing instructions. 2. Utilizing this dataset, the paper trains a network specifically designed to add objects to images. 3. The article is well-written, with clear and precise explanations.
1. In some scenarios, this method may not be applicable. For instance, if there are three tables in the image and I want to place a cup on one specific table, this approach might not work effectively. 2. The comparison method (such as SDEdit and Null-text inversion) were not specifically proposed for the task of adding objects through editing. Comparing with them may not be appropriate, and more suitable comparison methods should be included, such as BrushNet and other mask-based editing methods
The paper’s approach is reasonable, and it appears to outperform a diverse set of baselines on multiple datasets and across many metrics. In this sense, the comparisons shown in the paper are fairly comprehensive. There is also significant detail provided in the paper and the supplementary which should help facilitate reproducibility, and the authors also promise to release their dataset which could help future research on this task. Finally, the paper was relatively clear and easy to follow.
My primary concerns are twofold: First, the core idea of learning to add objects by constructing synthetic data using inpainting already exists in prior published work, including work which was published in conference proceedings before the ICLR deadline (e.g., ObjectDrop, Winter et al, ECCV 2024). I think the authors should, at the very least, tone down their focus on presenting this as a core contribution, and instead highlight other aspects of the work such as the importance of the dataset a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Surveying and Cultural Heritage · Aesthetic Perception and Analysis
MethodsInpainting · Diffusion
