Detector Guidance for Multi-Object Text-to-Image Generation
Luping Liu, Zijian Zhang, Yi Ren, Rongjie Huang, Xiang Yin, and Zhou Zhao

TL;DR
This paper introduces Detector Guidance, a method that uses a latent object detection model to improve multi-object text-to-image generation by reducing object mixing and enhancing object separation.
Contribution
The paper proposes a novel Detector Guidance approach that integrates latent object detection into diffusion models to better handle multiple objects in generated images.
Findings
DG improves object separation in generated images.
Human evaluations show 8-22% reduction in object mixing.
DG outperforms baseline methods on COCO, CC, and MRO benchmarks.
Abstract
Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment. We find that mixing problems also occur on the image side and in the cross-attention blocks. The noisy images can cause different objects to appear similar, and the cross-attention blocks inject information at a pixel level, leading to leakage of global object understanding and resulting in object mixing. In this paper, we introduce Detector Guidance (DG), which integrates a latent object detection model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · SentencePiece · Adafactor · Residual Connection
