Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

Hao Li

arXiv:2604.25314·cs.CV·April 29, 2026

Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

Hao Li

PDF

TL;DR

Golden RPG introduces a region-aware noise predictor for compositional text-to-image generation, improving spatial coherence and prompt fidelity by dynamically conditioning on sub-prompts.

Contribution

It extends diffusion models with region-specific conditioning mechanisms and a confidence-adaptive blending to enhance multi-region image synthesis.

Findings

01

Achieves highest cross-region coherence scores across benchmarks.

02

Matches top baselines in CLIP-based metrics.

03

User study shows 67% preference for Golden RPG outputs.

Abstract

Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.