Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment
Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao,, Jie Li, Fan Yang, Robert C. Qiu

TL;DR
This paper introduces a novel image editing framework that fuses visual references and text guidance within a pre-trained diffusion model, achieving high-quality, semantically consistent edits with intuitive control.
Contribution
It presents a new fusion approach that integrates visual and textual prompts into a frozen diffusion model using minimal neural network components, enhancing control and image quality.
Findings
Produces higher quality images than state-of-the-art methods
Ensures semantic consistency and realistic editing effects
Works effectively across various benchmark datasets
Abstract
The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Image Processing Techniques
MethodsDiffusion
