Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control
Denis Lukovnikov, Asja Fischer

TL;DR
This paper enhances ControlNet's ability to generate images from localized textual descriptions by modifying cross-attention scores during inference, enabling fine-grained control without additional training.
Contribution
It introduces a training-free cross-attention control method that improves layout-to-image generation with localized descriptions in ControlNet.
Findings
Improved control over image regions using localized descriptions
Reduction of concept bleeding and image degradation
Effective in challenging layout scenarios
Abstract
While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
MethodsDiffusion
