TL;DR
DragonDiffusion introduces a novel image editing technique that enables precise, drag-style manipulations on diffusion models without fine-tuning, allowing for diverse edits like object movement and resizing using internal feature guidance.
Contribution
It presents a new method for image editing on diffusion models using feature correspondence guidance, without requiring fine-tuning or extra modules.
Findings
Enables various editing modes such as object moving and resizing.
Does not require fine-tuning or additional modules.
Achieves content preservation and semantic alignment in edits.
Abstract
Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an…
Peer Reviews
Decision·ICLR 2024 spotlight
- While gradient guidance has been explored extensively, using this idea as a general approach to accomplish multiple image editing tasks is cool. The qualitative results as shown in Figure 12 is stunning. And, all of these are achieved without the use of any auxiliary model. - Caching the memory bank for improved image information preservation is a useful technique. - It is welcome to report the detailed model inference time as shown in Table 1. - The proposed method works well for real ima
- First of all, the paper proposes to use gradient guidance sampling for a bunch of tasks, but the paper writing and the experiments mainly focus on dragging-based editing. This will narrow down the scope of the paper quite a lot. It is suggested to formulate the paper as a general solution and equally treat multiple tasks. - Also, the experiment is not thorough enough. It is suggested to conduct comparisons on other tasks besides dragging edit. For example, for object pasting, it is suggested
* The task of user-defined handles is challenging and well-motivated -- – supported by various applications shown in the paper. * Evaluations were done with reasonable metrics and against SOTA methods, and decent improvements can be observed, especially the efficiency compared with DragDiffusion. Nice qualitative results are shown. * The method has significantly less complexity comparing with prior works, but seems to work well.
* Compared with prior (and concurrent) works such as DragGAN and DragDiffusion, way too few samples are shown. The paper and supplementary do not present enough challenging and diverse qualitative samples and comparisons. * The ablation is a bit incomplete. E.g. it will be nice to see some ablations on the usefulness of S_global. * Some flickering still happens in the no-change areas, e.g. clouds in the sun example and background in the apple example. If this is because of the balance between di
- The energy motivation that originates from classifier guidance is interesting. It motivates the design of the energy function for correspondence in diffusion models. - The visualization figure vividly demonstrates the editing effect.
- The clarity of how the memory bank is meaningful is not evident in this draft. As the memory bank is proposed as a contribution, the authors should provide a more comprehensive ablation study, including both quantitative and qualitative analysis. - How the energy design makes it works is not clear, the authors should provide more details numerical studies. - The inference time is too slow, approximately 15.93 in Table 1, which makes the solution incomparable with dragGAN.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsDiffusion
