R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image   Generation

Jiayu Xiao; Henglei Lv; Liang Li; Shuhui Wang; Qingming Huang

arXiv:2310.08872·cs.CV·November 28, 2023·1 cites

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, Qingming Huang

PDF

Open Access

TL;DR

This paper introduces a zero-shot grounded text-to-image generation method that uses region and boundary aware guidance to improve spatial accuracy and fidelity without additional training, significantly outperforming existing methods.

Contribution

The paper proposes a novel R&B-aware cross-attention guidance approach that modulates attention maps during diffusion to incorporate layout constraints without training auxiliary modules.

Findings

01

Outperforms state-of-the-art zero-shot grounded T2I methods.

02

Achieves high fidelity and layout accuracy in generated images.

03

Demonstrates significant improvements on multiple benchmarks.

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsAttentive Walk-Aggregating Graph Neural Network · Diffusion