Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing
Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade

TL;DR
Image-POSER introduces a reinforcement learning framework that dynamically orchestrates multiple image generation and editing experts to handle complex, long-form prompts, improving alignment, fidelity, and aesthetics in AI-generated images.
Contribution
The paper presents a novel reflective RL approach that manages diverse pretrained models for complex image synthesis and editing tasks, enabling end-to-end long-form prompt handling.
Findings
Outperforms baseline models in alignment, fidelity, and aesthetics
Achieves higher human preference scores
Demonstrates effective long-form prompt decomposition and expert pipeline learning
Abstract
Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred…
Peer Reviews
Decision·Submitted to ICLR 2026
The work proposes a novel formulation of image generation and editing. It proposes a different approach than traditional ones that predefine a sequence of subtasks. The novelty of this work is to formulate image generation as a sequential decision-making problem. This formulation allows for dynamic and adaptive orchestration, with the use of a reflective reinforcement learning environment. In this way, commands are incrementally created and the outputs validated. By orchestrating multiple pretr
There is a dependence on pretrained models. The orchestration relies on pretrained models. There is a risk of introducing inconsistencies, redundancies or conflicts. How the system deals with it? A drawback of the proposed approach is its high computational cost, as the reflective steps introduce significant latency which poses challenges for practical deployment and real-time applications.
1. The paper addresses an important challenge in text-to-image generation: following long, compositional prompts with sequential edits. Casting this as an RL task that orchestrates multiple specialized models is an interesting angle. 2. The approach leverages existing high-quality models (e.g., diffusion models, editors) without retraining them. As noted by the authors, the only learnable part is a lightweight DQN with a 3-layer MLP, making the system relatively plug-and-play. 3. According to
1. Limited analysis of VLM choice. The paper lacks experiments on selecting the VLM as a judge or a critic. 2. Missing efficiency analysis. The method involves multiple iterative steps, which may be less efficient than alternatives. Please provide a clear efficiency study compared with baselines under matched settings, and analyze scaling. 3. The DQN’s state is just the current instruction and the remaining task list embedded as text, which means the agent has very limited information. In esse
* The method trains only a lightweight DQN controller and keeps all experts fixed, requiring no retraining of large vision–language models. This modular structure makes it cost-efficient, flexible, and potentially useful as a plug-in layer for real-world multimodal systems. * The proposed Image-POSER system consistently outperforms both open-source and closed commercial models (e.g., GPT-Image-1, Gemini) on compositional and instruction-following tasks.
* The reliance on GPT-o3 as both the critic in training and the automatic evaluator introduces potential bias and circularity. Since the same model family judges the system it helps train, it is unclear whether improvements reflect genuine quality gains or alignment with that evaluator’s preferences. * The experiments are conducted on a relatively small computational scale, using short RL episodes and a limited replay buffer on a single T4 GPU. This raises concerns about policy stability, gener
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI
