Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

Jiadong Pan; Zhiyuan Ma; Kaiyan Zhang; Ning Ding; Bowen Zhou

arXiv:2505.22407·cs.CV·May 29, 2025

Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

Jiadong Pan, Zhiyuan Ma, Kaiyan Zhang, Ning Ding, Bowen Zhou

PDF

Open Access 4 Reviews

TL;DR

This paper introduces SRRL, a self-reflective reinforcement learning algorithm for diffusion models that enhances logical image reasoning by iterative reflection, significantly improving generation quality in physics-adherent and unconventional scenarios.

Contribution

The paper presents the first integration of Chain of Thought reasoning with diffusion models using reinforcement learning for logical image generation.

Findings

01

SRRL outperforms existing methods in logical image reasoning tasks.

02

The approach enables reasoning in images adhering to physical laws.

03

Experimental results surpass GPT-4o in case studies.

Abstract

Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

* The paper addresses a novel problem. * The paper is clearly written and easy to follow.

Weaknesses

* Lack of analysis of computational overhead. Intuitively, even in the forward process, the model needs to perform k complete inferences and k-1 VQA model inferences, which is extremely expensive. * Some images are difficult to understand. In Figure 5, both downward and rightward directions are labeled as *Round ↑* . This makes it difficult to understand the actual process. And the figure doesn't convey the conclusion being described. * The specific value of K used in forward process is not clea

Reviewer 02Rating 2Confidence 3

Strengths

- Mapping CoT reasoning to an iterative, trajectory-wise refinement process for diffusion is a novel concept. - Addresses the critical challenge of infusing generative models with reasoning.

Weaknesses

- Uses reward models (VQAScore, ImageReward) built for alignment, not for the target task of physical reasoning. Also, same reward models are used for evaluations, prone to reward hacking. - The multi-round, re-noising (DDIM inversion) design is extremely costly. - The model is evaluated on the same prompts it was trained on, raising concerns about overfitting vs. true generalization.

Reviewer 03Rating 2Confidence 4

Strengths

* The topic of the paper is highly valuable. Progress in aligning text-to-image models with user’s preferences has wide impacts. * The concept of using inversion in reward-model-guided generation is novel to the best of my knowledge.

Weaknesses

* The writing clarity could be improved. The main method was difficult to understand from the paper. The training objective should be stated explicitly and clearly. Including the training algorithm in the main paper (even in a more concise form) would greatly assist in the readability of the method. Also, key pieces of information included in the appendix are not referred to in the text, for example the list of prompts used to create Tab. 1. * The authors do not include related work on test-time

Reviewer 04Rating 2Confidence 3

Strengths

- The paper is well-written with a clear and logical structure. - The work innovatively applies CoT to diffusion models to achieve reasoning generation. - Quantitative results show notable improvements in metrics like CLIP Score, ImageReward, and VQAScore.

Weaknesses

- A critical limitation is that all employed reward models and metrics primarily focus on semantic alignment between images and prompts, lacking the ability to capture fine-grained differences in reasoning generation of logical images. In other words, while the paper proposes a novel image reasoning generation task with interesting cases, there is a lack of direct evidence that the framework actually enables the model to learn reasoning, rather than merely generating semantically consistent imag

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Evolutionary Algorithms and Applications · Model Reduction and Neural Networks