RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection
Tianyu Wang, Zhiyuan Ma, Qian Wang, Xinyi Zhang, Xinwei Long, Bowen Zhou

TL;DR
RL-RIG introduces a reinforcement learning-based spatial reasoning framework for image generation, significantly improving spatial accuracy and structural integrity in generated images compared to existing models.
Contribution
The paper presents RL-RIG, a novel generate-reflect-edit paradigm with a Chain of Thought reasoning approach for enhanced spatial reasoning in image generation.
Findings
RL-RIG outperforms state-of-the-art models by up to 11% in spatial reasoning accuracy.
Utilizes Scene Graph IoU and VLM-as-a-Judge for spatial consistency evaluation.
Develops Reflection-GRPO for training the Actor and Image Editor for better image quality.
Abstract
Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
