From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

TL;DR
This paper introduces a new framework for open-ended image editing that combines planning, tool orchestration, and outcome-based rewards to improve coherence and reliability in complex tasks.
Contribution
It presents an experiential, reward-driven approach that tightly couples planning with execution, surpassing prior rule-based and single-step methods.
Findings
Outperforms rule-based multistep baselines in image editing coherence.
Uses outcome-based rewards to refine planning and execution.
Achieves more reliable and visually coherent edits.
Abstract
Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
