TL;DR
GoT-R1 enhances visual generation from complex prompts by applying reinforcement learning to improve semantic and spatial reasoning, significantly advancing the state-of-the-art in compositional image tasks.
Contribution
It introduces a reinforcement learning framework that enables visual models to autonomously develop reasoning strategies beyond predefined templates.
Findings
Significant improvements on T2I-CompBench benchmark.
Enhanced handling of complex spatial and attribute relationships.
Effective supervision via a dual-stage reward system.
Abstract
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
