ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu

TL;DR
ReasonGen-R1 enhances autoregressive image generation by integrating chain-of-thought reasoning with supervised fine-tuning and reinforcement learning, leading to improved visual quality and controlled scene composition.
Contribution
It introduces a novel two-stage framework combining SFT and RL for reasoning in image generation, with a new dataset of rationales and a custom optimization algorithm.
Findings
Outperforms prior state-of-the-art models on multiple benchmarks.
Demonstrates improved control over object layouts and scene styles.
Achieves higher visual quality scores in evaluations.
Abstract
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The method is well motivated 2. Current results show that out of the methods considered here the proposed method outperforms 3. The paper uses a simple yet straightforward methodology
1. Without using some standard large scale benchmarks like Imagenet it is very hard to judge the quality of the model. 2. Although the authors evaluate on DPG bench, genval and compbench. There lies a very inherent bias and noise specific to these benchmarks, they use methods like object detectors which can throw out a lot of false negatives and cannot detect classes beyond a fixed vocabulary. What are steps taken to make sure that the this method does not have these biases 3. The comparisons
1. The motivation is clear. 2. The combination of textual reasoning and image tokens is novel. 3. The ablation study is comprehensive.
1. RL reward is provided by a single VLM judge (Qwen2.5-VL-7B). Is the policy overfitting that judge? 2. The evaluation benchmarks (GenEval, DPG-Bench, T2I-Benchmark; Tables 1–3) mostly test object count, color binding, spatial relations, etc. What about the human preference evaluation benchmark? For instance, MM-RewardBench. 3. Human evaluation is missing. 4. Figure 4 shows RL is unstable without adaptive entropy loss. The theoretical justification could be proposed. 5. The work does not achiev
1. The key contribution of this work lies in introducing the Chain-of-Thought (CoT) and Reinforcement Learning (RL) paradigms, which have proven effective in the LLM domain, into autoregressive image generation models. By enabling the model to generate a reasoning plan before creating an image, it effectively decomposes complex instruction-following tasks into manageable intermediate steps. 2. The two-stage training framework ensures that the model learns the correct reasoning structure and
1. The reward model (RM) is built upon Qwen-2.5-VL and provides binary scores. The current binary scoring can be quite extreme – minor deviations in text or image quality might result in a reward of 0, which could pose challenges for training. 2. The autoregressive generative model must generate an entire CoT text sequence during inference, which inevitably increases inference latency. Although performance is improved, the additional computational overhead presents a challenge for real-time
* The motivation is clear and easy to grasp and the method demonstrates improved performance on tasks that standard image generators struggle with. * Good ablation study showing SFT and adaptive entropy loss matters and boost final performance. * Well documented and transparent training disclosure and data disclosure.
* Several highly related work such as GoT-R1, T2I-R1, all uses chain-of-thought plus RL on AR image generation models, is not mentioned or compared in any way at all. * The abstract suggests the RL reward is mainly about “overall visual quality” as judged by a VLM and a rather small one . This is a very high-level and coarse signal with potential of hallucinations and hacking. * The experiments seem mostly focused on the compositional benchmark, evaluation on broader text-to-image tasks (COCO et
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Games
MethodsDeterministic Policy Gradient
