LLM-guided Instance-level Image Manipulation with Diffusion U-Net   Cross-Attention Maps

Andrey Palaev; Adil Khan; Syed M. Ahsan Kazmi

arXiv:2501.14046·cs.CV·January 27, 2025

LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps

Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method that uses Large Language Models, cross-attention maps, and diffusion U-Net activations to enable precise, instance-level image manipulation based on textual prompts without requiring additional training or masks.

Contribution

The proposed pipeline allows for accurate, instance-level image editing guided by LLMs and cross-attention, eliminating the need for fine-tuning or auxiliary input masks.

Findings

01

Enables precise object manipulation without training or masks

02

Uses cross-attention maps for coherence in edits

03

Supports flexible, prompt-based image editing

Abstract

The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

palandr123/diffusionu-netllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · Image and Object Detection Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · Concatenated Skip Connection · U-Net · Diffusion