ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu

TL;DR
ReasonEdit introduces reasoning mechanisms like thinking and reflection into image editing models, significantly improving their understanding and accuracy by enabling iterative interpretation and correction of edits.
Contribution
The paper proposes a novel reasoning-enhanced framework for image editing that incorporates thinking and reflection mechanisms, advancing beyond existing frozen MLLM-based systems.
Findings
Achieved +4.3% to +8.2% performance improvements on multiple benchmarks.
Demonstrated effectiveness of reasoning mechanisms in interpreting abstract instructions.
Outperformed previous open-source methods on GEdit and Kris datasets.
Abstract
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear Motivation. ReasonEdit effectively leverages the reasoning capabilities of MLLM to enhance image editing performance. The authors explore both thinking and reflection modes for an instruction-based editing task. Instead of treating the MLLM as a frozen feature extractor, the authors jointly optimize MLLM with the diffusion decoder based on their reasoning-enhanced dataset, improving the performance under abstract instructions. 2. Well-designed data curation and training strategy. The pr
1. The "reflection" mechanism is designed as an "iterative self-correction and optimization" process. This iterative process inevitably increases inference time (Latency) and computational overhead, making it slower than single-pass editing models. However, the paper does not report evaluations on inference speed, computational cost, or the average number of reflection rounds required for success throughout. This is a significant limitation for the practical application of the model. 2. Although
- The introduction of explicit, modular “thinking” (instruction grounding/decomposition) and “reflection” (iterative self-correction) mechanisms directly within an image editing pipeline is well-motivated and appropriately positioned with respect to recent advances in reasoning for multimodal models. - The work details a robust data pipeline (including the construction of both “Thinking Pairs” and “Reflection Triples”) to support supervised training for both the reasoning and editing aspects. Th
- While the paper is methodologically sound, it does not provide deeper theoretical or formal justification for why incorporating reasoning delivers better generalization or robustness beyond anecdotal/empirical evidence. There is no formal analysis of potential failure modes, e.g., when thinking/reflection might increase hallucination or overfit to annotation artifacts. - The work focuses solely on the image editing scenario and largely benchmarks against datasets that were partially constructe
- Well-motivated and clearly structured framework: The motivation—to address the limited reasoning capability of existing MLLM-frozen editing pipelines—is clearly articulated. The decomposition into a Reasoner (MLLM) and Generator (DiT), coupled through thinking and reflection cycles, provides a coherent design that is easy to follow and conceptually elegant. - Innovative data construction tailored for reasoning-aware editing: The paper goes beyond standard instruction-image datasets by introduc
- Dataset generation and reproducibility insufficiently detailed: The Thinking and Reflection datasets rely heavily on automated labeling with advanced MLLMs, but the paper does not disclose which annotators or models were used, nor any quality control metrics (agreement rate, filtering thresholds, or bias mitigation). Without these details, reproducibility and data reliability remain unclear. - Evaluation over-relies on GPT-4/4o automatic scoring: Most benchmarks use GPT-based metrics (VIEScore
1. Overall this paper is well-organized and easy to read. The figures illustrates the idea, especially the thinking–editing–reflection loop very clearly. 2. The experiment section is extensive. It conducts experiments on mainstream benchmarks, and provide comparison results with enough baseline method. So I think the experiment results can be convincing. 3. The idea is simple yet reasonable, it makes the image-editing system more consistent with human being. I can understand that this RL-like
1. The idea is reasonable, but the novelty is just ok, but not high. I do not mean 'pursuing novelty' here, but I am happy to see more discussions of novelty or any other interesting parts in model design. 2. The authors can have some discussions on the limitation of the proposed method, including some failure case. I think it will be beneficial to the community. 3. I also want to see some discussions on the training cost or the stability of the model. 4. I think the introduction part of the p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship
