TL;DR
This paper introduces a new framework and benchmark for iterative, reflective visual generation, addressing limitations of single-pass models in handling complex prompts through multi-round reasoning and rectification.
Contribution
It formalizes the R^3 loop for multi-round visual generation, creates R^3-Bench for evaluating reasoning and rectification, and proposes R^3-Refiner to improve model performance.
Findings
State-of-the-art models identify errors but cannot generate rectification instructions.
R^3-Refiner improves scores by 12% in Reflective Verdict and 9% in Rectification.
The framework enhances the quality of visual generation across multiple models.
Abstract
Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
