RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji

TL;DR
RePrompt introduces a reinforcement learning-based framework that enhances text prompts for image generation by explicitly reasoning about visual semantics, leading to more accurate and faithful image synthesis.
Contribution
It is the first to incorporate explicit reasoning into prompt enhancement for text-to-image models using reinforcement learning, improving fidelity and compositionality.
Findings
Significantly improves spatial layout fidelity.
Enhances compositional generalization across models.
Achieves new state-of-the-art results on benchmarks.
Abstract
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is clearly written and easy to follow. 2. Table 2 shows great transferability of the prompts among different text-to-image models. 3. The ablation study conducted in this paper is very thorough.
1. My main concern about this paper is regarding its novelty. The training procedure of their model is a very standard RL recipe for LLM, and using LLM as an automated prompt generator for text-to-image generation is not a new idea either (e.g. Hao et al. (2023); Mo et al. (2024); Yeh et al. (2024); Ma ̃nas et al. (2024); Yun et al. (2025); Cao et al. (2023); Qin et al. (2024); Yang et al. (2024d); Wu et al. (2024); Wang et al. (2024) in the paper). It seems like this paper would be better suite
1. The integration of explicit, structured reasoning with RL for prompt enhancement is a well-motivated approach. It effectively bridges the gap between linguistic fluency and visual plausibility that plagues LLM-based prompters. 2. The paper demonstrates consistent performance gains across three different diffusion-based T2I models (FLUX, SD3, Pixart-Σ). The improvements in challenging areas like spatial reasoning are compelling. 3. The framework is designed to be T2I model-agnostic, requiring
1. The training and evaluation prompts are heavily focused on object-centric, compositional generation (training prompts sourced from GenEval-like templates). This raises a concern about potential overfitting to the specific categories and styles of the benchmarks used. 2. It is unclear how RePrompt would perform on more diverse, stylized, imaginative, or long-form narrative prompts that are common in real-world use. 3. While the paper shows generalization across diffusion-based models, its
- The idea of combing LLM reasoning and image-level feedback is novel and promising. - The reward is also well designed. First of all, the visual-reasoning reward acts as a bridge to connect image reward (human preference alignment) with semantic grounding (VLM reward). Second, it allows the reward to depend only on the behavior of input and output, enabling model-agnostic characteristic of RePrompt across different T2I backbones. - The ablations and theoretical analysis (in Appendix B), toge
- The evaluation benchmarks are only object-centric datasets. The performance on open-world prompt is not verified. It is better to show several examples on this scenario. - No failure cases in the visualization. What is the model behavior on rare, free-form prompts not covered in GenEval? For example, "a photo of a cat working in an office"
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
