LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps
Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi

TL;DR
This paper introduces a novel method that uses Large Language Models, cross-attention maps, and diffusion U-Net activations to enable precise, instance-level image manipulation based on textual prompts without requiring additional training or masks.
Contribution
The proposed pipeline allows for accurate, instance-level image editing guided by LLMs and cross-attention, eliminating the need for fine-tuning or auxiliary input masks.
Findings
Enables precise object manipulation without training or masks
Uses cross-attention maps for coherence in edits
Supports flexible, prompt-based image editing
Abstract
The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · Image and Object Detection Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · Concatenated Skip Connection · U-Net · Diffusion
