RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Mingrui Wu; Lu Wang; Pu Zhao; Fangkai Yang; Jianjin Zhang; Jianfeng Liu; Yuefeng Zhan; Weihao Han; Hao Sun; Jiayi Ji; Xiaoshuai Sun; Qingwei Lin; Weiwei Deng; Dongmei Zhang; Feng Sun; Qi Zhang; Rongrong Ji

arXiv:2505.17540·cs.CV·May 26, 2025

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji

PDF

1 Repo 3 Reviews

TL;DR

RePrompt introduces a reinforcement learning-based framework that enhances text prompts for image generation by explicitly reasoning about visual semantics, leading to more accurate and faithful image synthesis.

Contribution

It is the first to incorporate explicit reasoning into prompt enhancement for text-to-image models using reinforcement learning, improving fidelity and compositionality.

Findings

01

Significantly improves spatial layout fidelity.

02

Enhances compositional generalization across models.

03

Achieves new state-of-the-art results on benchmarks.

Abstract

Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper is clearly written and easy to follow. 2. Table 2 shows great transferability of the prompts among different text-to-image models. 3. The ablation study conducted in this paper is very thorough.

Weaknesses

1. My main concern about this paper is regarding its novelty. The training procedure of their model is a very standard RL recipe for LLM, and using LLM as an automated prompt generator for text-to-image generation is not a new idea either (e.g. Hao et al. (2023); Mo et al. (2024); Yeh et al. (2024); Ma ̃nas et al. (2024); Yun et al. (2025); Cao et al. (2023); Qin et al. (2024); Yang et al. (2024d); Wu et al. (2024); Wang et al. (2024) in the paper). It seems like this paper would be better suite

Reviewer 02Rating 4Confidence 3

Strengths

1. The integration of explicit, structured reasoning with RL for prompt enhancement is a well-motivated approach. It effectively bridges the gap between linguistic fluency and visual plausibility that plagues LLM-based prompters. 2. The paper demonstrates consistent performance gains across three different diffusion-based T2I models (FLUX, SD3, Pixart-Σ). The improvements in challenging areas like spatial reasoning are compelling. 3. The framework is designed to be T2I model-agnostic, requiring

Weaknesses

1. The training and evaluation prompts are heavily focused on object-centric, compositional generation (training prompts sourced from GenEval-like templates). This raises a concern about potential overfitting to the specific categories and styles of the benchmarks used. 2. It is unclear how RePrompt would perform on more diverse, stylized, imaginative, or long-form narrative prompts that are common in real-world use. 3. While the paper shows generalization across diffusion-based models, its

Reviewer 03Rating 8Confidence 3

Strengths

- The idea of combing LLM reasoning and image-level feedback is novel and promising. - The reward is also well designed. First of all, the visual-reasoning reward acts as a bridge to connect image reward (human preference alignment) with semantic grounding (VLM reward). Second, it allows the reward to depend only on the behavior of input and output, enabling model-agnostic characteristic of RePrompt across different T2I backbones. - The ablations and theoretical analysis (in Appendix B), toge

Weaknesses

- The evaluation benchmarks are only object-centric datasets. The performance on open-world prompt is not verified. It is better to show several examples on this scenario. - No failure cases in the visualization. What is the model behavior on rare, free-form prompts not covered in GenEval? For example, "a photo of a cat working in an office"

Code & Models

Repositories

microsoft/dki_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.