Control and Realism: Best of Both Worlds in Layout-to-Image without Training
Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang

TL;DR
This paper introduces WinWinLay, a training-free method for layout-to-image generation that improves control precision and realism by addressing attention biases and out-of-distribution artifacts, outperforming existing methods.
Contribution
WinWinLay proposes a novel training-free approach with non-local attention and adaptive updates to enhance layout control and image realism in diffusion models.
Findings
Outperforms state-of-the-art in layout control accuracy
Achieves higher photorealism in generated images
Effectively reduces artifacts and localization errors
Abstract
Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging
