PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models
Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka

TL;DR
PartEdit introduces a novel method for fine-grained, part-specific image editing using pre-trained diffusion models by learning special textual tokens and localization masks, significantly improving editing precision and user preference.
Contribution
This work expands pre-trained diffusion models to understand and edit specific object parts, enabling more precise and user-preferred image edits.
Findings
Outperforms existing editing methods on all metrics
Achieves 66-90% user preference in studies
Establishes a new benchmark and evaluation protocol for part editing
Abstract
We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and…
Peer Reviews
Decision·Submitted to ICLR 2025
(1) This paper focus on an interesting question, which as great significance to downstreaming research and tasks. (2) The overall design of the model design is generally make sense. (3) This paper is easy to follow.
(1) Some related works are missing. Discussing and compare these related works will be good for improving the paper's quality. - [1] Pnp inversion: Boosting diffusion-based editing with 3 lines of code - [2] Inversion-free image editing with natural language - [3] Dragondiffusion: Enabling drag-style manipulation on diffusion models (2) Can you provide more experimental results to prove the effectiveness of the proposed method? For example, more comparison results with training-based editing
1. This paper addresses a critical problem in image editing: the inability to accurately edit specific parts of an object while keeping the rest of the image unchanged. 2. The use of token optimization to learn adaptable tokens for subsequent editing tasks is intuitive and intriguing. 3. The experiments are thorough, with comprehensive ablation studies that validate the effectiveness of the proposed approach. 4. The paper is well-written, easy to follow, and logically structured.
1. The images used for training part tokens are very limited, with only 10–20 images. In such cases, the representativeness of the images is crucial for generalization. It would strengthen the paper if the authors would conduct experiments to show the impact of varying the types of training images on the model's performance. 2. The method involves many hyperparameters that require tuning, including the number of diffusion timesteps for training part tokens and inference, the selection of layers
1. The paper presents a flexible method for text-based image editing focused on object parts, which is a novel contribution to the field of image processing and editing. 2. The paper is well-written and easy to follow. 3. The authors have conducted extensive experiments and provide a solid basis for its practical application.
1. The approach relies on a finite and manually defined set of part tokens, which could restrict the flexibility and applicability of the method in real-world scenarios where users might need to edit object parts that are not covered by the predefined tokens. This limitation could affect the generalizability of the technique to a broader range of editing tasks and objects. 2. There are many methods nowadays that utilize semantic segmentation to create masks, which are quite similar to this paper
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging
MethodsDiffusion
