GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan; Rongyao Fang; Yuqing Wang; Kun Wang; Linjiang Huang; Xingyu Zeng; Hongsheng Li; Xihui Liu

arXiv:2505.17022·cs.CV·April 14, 2026

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu

PDF

1 Repo

TL;DR

GoT-R1 enhances visual generation from complex prompts by applying reinforcement learning to improve semantic and spatial reasoning, significantly advancing the state-of-the-art in compositional image tasks.

Contribution

It introduces a reinforcement learning framework that enables visual models to autonomously develop reasoning strategies beyond predefined templates.

Findings

01

Significant improvements on T2I-CompBench benchmark.

02

Enhanced handling of complex spatial and attribute relationships.

03

Effective supervision via a dual-stage reward system.

Abstract

Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gogoduan/GoT-R1
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.