InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models
Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang, Zenghui Xiong, Yifan Ding, Aoxiang Ping, Xiang Li, Tong Guo, Yao Mao

TL;DR
InEdit-Bench is a new benchmark designed to evaluate the ability of image editing models to reason over and generate coherent intermediate logical pathways during complex visual manipulations, addressing a key limitation in current models.
Contribution
This paper introduces InEdit-Bench, the first benchmark for assessing reasoning over intermediate pathways in image editing, including evaluation criteria and comprehensive testing of existing models.
Findings
Current models show significant shortcomings in reasoning over intermediate pathways.
InEdit-Bench reveals widespread deficiencies in dynamic, multi-step image editing capabilities.
Benchmark encourages development of more reasoning-aware, intelligent image editing models.
Abstract
Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Interesting problem: evaluation beyond single-shot edits. Attempt to structure multi-step edits with sub-tasks.
I have several issues with this paper, and not sure where should I start. But here is my attempt. **Major Conceptual Issues** 1. Fundamental Mischaracterization of the Task. My biggest concern is that the paper conflates visual interpolation/transition generation with reasoning. Generating intermediate frames between two images is primarily an interpolation or perhaps an animation task, not a reasoning task. The claim that this measures "procedural reasoning" and "causal understanding" is e
- Well-motivated task: moves evaluation from “final image only” to “full editing trajectory,” which is missing in current benchmarks. - Process-oriented metrics: adding logical/scientific plausibility on top of standard vision metrics is a concrete contribution.
- Small scale (237 cases) for a benchmark that aims to compare many models; unclear robustness. - Single judge dependency: relies heavily on one LMM evaluator; no human–LMM agreement or judge ablations.
* Propose a novel benchmark for image editing with new data and new evaluation methodology. * Propose a new task of evaluating generation quality along editing path.
* This benchmark heavily relies on AI for both dataset construction and evaluation metrics. While this can be seen as a strength (automatization), the authours should validate the AI components. First, the generated images seem sometimes of poor quality. For example, I am concerned over the soundness of the "Science" part of the benchmark. Moreover, the evaluation methodology also relies on VLMs. This raises questions about robustness and correlation with human perception of the used VLM. How st
- **Novel evaluation perspective:** Focuses on the reasoning process of image editing rather than only final output quality — a relevant and underexplored dimension. - **Dataset diversity:** Includes varied categories (scientific, temporal, dynamic), providing a potentially useful testbed for assessing sequential or reasoning-based editing. - **Broad model coverage:** Evaluates many prominent image editing models, both open-source and proprietary.
1. **Ad-hoc task taxonomy.** The division into four categories and sixteen sub-tasks lacks principled justification. Definitions between task types (e.g., “temporal sequence” vs “dynamic process”) are blurry. 2. **Unvalidated evaluation method.** The three novel metrics depend entirely on GPT-4o prompts with no human validation or correlation study. Human evaluation is essential to confirm alignment of VLM scores with human judgment. 3. **Undefined “accuracy.”** Table 1 reports “Accuracy” wi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
