Forgedit: Text Guided Image Editing via Learning and Forgetting
Shiwen Zhang, Shuai Xiao, Weilin Huang

TL;DR
Forgedit is a novel text-guided image editing method that leverages a joint optimization framework, a vector projection mechanism, and a forgetting strategy to achieve fast, controllable, and high-quality edits surpassing previous state-of-the-art techniques.
Contribution
The paper introduces Forgedit, combining a rapid reconstruction framework, a novel vector projection in text embedding space, and a forgetting mechanism based on UNet properties to improve image editing capabilities.
Findings
Achieves faster reconstruction than previous SOTA methods.
Controls identity similarity and editing strength separately.
Surpasses previous SOTA on TEdBench in CLIP and LPIPS scores.
Abstract
Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e.,…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The writing is clear and easy to follow. 2. To achieve the desired editing, the authors propose an adaptation of DreamBooth and also incorporate the optimization strategy from Imagic. To address potential overfitting arising from a single input image, a forgetting strategy is introduced. 3. The experiments provide evidence of the effectiveness of the proposed method, both in the context of rigid and non-rigid editing.
1. The training strategy of the proposed method is similar to Imagic, with the main differences being that the authors employ BLIP to generate a caption describing the input image, and combine the first and second stages in Imagic into one. Besides, authors use DreamBooth as the backbone. 2. I find the location of the point (1-y)e_src + ye_tgt in Figure 2 confusing, and I'm uncertain why the value of y (gamma) exceeds 1 in vector subtraction. Typically, y should fall within the range [0,1] if n
1) The paper performs extensive explorations on diffusion-based image editing. The mechanisms the authors explore include the difference between vector subtraction and projection, changes brought by keeping and dropping different weights of unet. These explorations are meaningful and can provide insights to readers. 2) The paper is well-organized and easy to follow. 3) The proposed method achieves state-of-the-art performance on the image editing benchmark.
1) There are many components that should be adjusted at the inference time. It is troublesome to adjust all these parameters manually. 2) For vector subtraction and vector projection, we need to decide which variant to use and also there are some hyper-parameters in these two variants that need to be determined. 3) For Fig. 5 and Fig. 6, it is hard to tell the settings of each column from the captions. 4) In Table 1, the quantitative results of other methods are missing.
This paper overall is clear and easy to follow.
1. Although, the paper has presented convincing results to solve image editing problems of diffusion model, the bag of tricks are now new and just work as expected. 2. Vector subtraction has been widely used in generative image editing, in VAEs, GANs and diffusion models. 3. Vector projection is a kind of component analysis, which has been well studied in latent code manipulation in GANs. 4. Using captioner to get source prompt is straightforward, and usually it's not even required, since vi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsDiffusion · Contrastive Language-Image Pre-training
