RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

TL;DR
RePlan is a novel framework for instruction-based image editing that combines step-by-step reasoning with explicit region grounding, enabling precise multi-region edits in cluttered scenes, and outperforms existing methods on a new challenging benchmark.
Contribution
We introduce RePlan, a reasoning-guided region planning framework with a new training-free editing mechanism and a reinforcement learning approach, advancing the state-of-the-art in complex instruction-based image editing.
Findings
RePlan achieves higher regional precision and fidelity than baselines.
The reinforcement learning training improves reasoning accuracy.
RePlan performs well on the new IV-Edit benchmark.
Abstract
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained…
Peer Reviews
Decision·Submitted to ICLR 2026
# Originality - The paper systematically introduces the concept of "Instruction-Visual Complexity (IV-Complexity)," explicitly defining the challenges arising from the interplay between visual complexity (cluttered layouts, multiple similar objects) and instructional complexity (multi-object references, implicit semantics, knowledge reasoning). - Through attention mask rules (prompt isolation, region constraints, background constraints, etc.), the method achieves precise multi-region parallel ed
- All main results (Table 1) rely on 5-point scoring from a single closed-source model, presenting risks that Gemini may exhibit bias towards its own model (Gemini-Flash-Image) with more lenient scoring, or that results may not be reproducible. - The paper does not specify the image sources, annotation procedures, or other critical aspects of the dataset construction process. - Key hyperparameters such as learning rate, batch size, number of training epochs, and GRPO group size are not documen
1. The proposed region-level control is both necessary and interesting. 2. A new benchmark is introduced to evaluate the proposed task. 3. Qualitative results demonstrate clear improvements over previous methods.
1. Figures 9–12 are confusing, as they are not referenced in the text nor sufficiently explained in the captions. 2. I wonder whether using the generated text from the VLM as input to existing MMDiT-style editing models (e.g., Flux-Edit) would still improve performance. This raises the question of whether the effectiveness comes from the comprehensive textual information or from the training-free editing mechanism itself. 3. In Stage 1 (RL), it is unclear how the rewards are computed and what
1. The introduction of “Instruction–Visual Complexity” formalizes an important challenge in multimodal editing and provides clear motivation for a new benchmark. 2. The combination of reasoning-driven planning with region-level attention control is well thought out and technically sound, avoiding costly retraining. 3. The new IV-Edit benchmark and extensive comparisons with both open-source and proprietary models strengthen the empirical validation.
1. Unclear realization of “step-by-step reasoning”: Although the paper claims that the planner decomposes instructions via step-by-step reasoning, neither the framework diagram nor the visualization results clearly illustrate this multi-step reasoning process. The reasoning seems to occur implicitly rather than explicitly, weakening the claimed interpretability advantage. 2. The IV-Edit benchmark (∼800 samples) may not be large enough to comprehensively assess generalization, especially for re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
