R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation
Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, Qingming Huang

TL;DR
This paper introduces a zero-shot grounded text-to-image generation method that uses region and boundary aware guidance to improve spatial accuracy and fidelity without additional training, significantly outperforming existing methods.
Contribution
The paper proposes a novel R&B-aware cross-attention guidance approach that modulates attention maps during diffusion to incorporate layout constraints without training auxiliary modules.
Findings
Outperforms state-of-the-art zero-shot grounded T2I methods.
Achieves high fidelity and layout accuracy in generated images.
Demonstrates significant improvements on multiple benchmarks.
Abstract
Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsAttentive Walk-Aggregating Graph Neural Network · Diffusion
