Magic Insert: Style-Aware Drag-and-Drop
Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael, Rubinstein, David E. Jacobs, Shlomi Fruchter

TL;DR
Magic Insert introduces a style-aware drag-and-drop technique that seamlessly inserts subjects from one image into another with a matching artistic style, leveraging diffusion models, CLIP, and domain adaptation.
Contribution
The paper formalizes style-aware object insertion and proposes a novel method combining diffusion fine-tuning, CLIP style infusion, and domain adaptation for realistic stylized image editing.
Findings
Outperforms traditional inpainting methods in style-aware insertion
Effectively personalizes subjects using LoRA and CLIP
Creates a new dataset for evaluation and future research
Abstract
We present Magic Insert, a method for dragging-and-dropping subjects from a user-provided image into a target image of a different style in a physically plausible manner while matching the style of the target image. This work formalizes the problem of style-aware drag-and-drop and presents a method for tackling it by addressing two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, our method first fine-tunes a pretrained text-to-image diffusion model using LoRA and learned text tokens on the subject image, and then infuses it with a CLIP representation of the target style. For object insertion, we use Bootstrapped Domain Adaption to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional approaches such as…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The key contribution in this paper is forming the combination of object insertion and stylized personal reference object generation. With the two well studied problem, the proposed method trying define a new settings for image editing especially on image composition. Besides, this paper demonstrated a unified pipeline for solving the proposed problem. Especially, the object insertion quality has been greatly improved with high quality data with the proposed filtering schemes. Experiments als
There are three concerns for the method: 1. The "magic insertion" actually has been studied in a similar manner in previous work [1] and [2][3]. [1] may be a concurrent work but [2][3] have been showing that image composition on stylized images. There methods needs more discussion. 2. The "Bootstrapped Domain Adaption" is actually a very similar way to object drop [4] but it just in a reversed way. What's the difference between these two works also needs to be discussed. 3. The identity sh
This work proposes to solve the problem of object insertion while keeping the subject's identity. Personalized image generation or composition has been widely studied in recent years. However, the task of inserting objects into the background with style harmonization is less explored. This work provides insight into how to adapt the object style to match the background style.
1. The biggest concern is identity preservation. This paper shows some results on simple objects like unreal 3D objects, objects without much textures. I would like to see more diverse objects, especially real-world objects with complex textures, scene texts, and logos. Otherwise, the identity preservation of this work will not be convincing. 2. The comparison methods should include the general object insertion or image composition methods which are not exactly performing style matching but may
1. This paper aims to achieve style-aware drag-and-drop, which is an interesting and challenging problem. 2. This paper provides a new dataset consists of subjects and backgrounds that span widely different styles and overall semantics. 3. The generated images of the proposed method seem plausible.
1. The so-called 'style-aware drag-and-drop' problem explored in this paper is quite similar to the earlier task of image composition/image blending, which entails seamlessly incorporating the given object into the specific visual context without altering the object’s appearance while ensuring natural transitions [1,2,3]. Therefore, I'm concerned that 'style-aware drag-and-drop' cannot be considered a new problem. Moreover, this paper lacks an introduction and comparison of these highly related
This paper introduces a two-part solution for style-aware drag-and-drop. The combination of style-aware personalization and domain-adapted insertion is interesting for applications requiring coherence between different visual styles. The proposed bootstrapped domain adaptation, which re-trains models using their own filtered outputs, shows practical effectiveness in enhancing insertion realism, with attention to shadows and reflections for a seamless integration. The SubjectPlop dataset provi
While the paper compares against baselines like StyleAlign and InstantStyle, the effectiveness of the comparisons could be improved. Clarifying the exact metrics used for these baseline models would strengthen the argument, as some performance differences (e.g., style fidelity vs. subject fidelity) are only discussed qualitatively. The method is very complex as the pipeline involves multiple complex steps, including LoRA training, CLIP embedding, adapter injection, and bootstrapped domain adapt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games
MethodsContrastive Language-Image Pre-training · Diffusion
