Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation
Hao Li

TL;DR
Golden RPG introduces a region-aware noise predictor for compositional text-to-image generation, improving spatial coherence and prompt fidelity by dynamically conditioning on sub-prompts.
Contribution
It extends diffusion models with region-specific conditioning mechanisms and a confidence-adaptive blending to enhance multi-region image synthesis.
Findings
Achieves highest cross-region coherence scores across benchmarks.
Matches top baselines in CLIP-based metrics.
User study shows 67% preference for Golden RPG outputs.
Abstract
Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
