InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Zhiqiang Sheng; Xumeng Han; Zhiwei Zhang; Zenghui Xiong; Yifan Ding; Aoxiang Ping; Xiang Li; Tong Guo; Yao Mao

arXiv:2603.03657·cs.CV·March 5, 2026

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang, Zenghui Xiong, Yifan Ding, Aoxiang Ping, Xiang Li, Tong Guo, Yao Mao

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

InEdit-Bench is a new benchmark designed to evaluate the ability of image editing models to reason over and generate coherent intermediate logical pathways during complex visual manipulations, addressing a key limitation in current models.

Contribution

This paper introduces InEdit-Bench, the first benchmark for assessing reasoning over intermediate pathways in image editing, including evaluation criteria and comprehensive testing of existing models.

Findings

01

Current models show significant shortcomings in reasoning over intermediate pathways.

02

InEdit-Bench reveals widespread deficiencies in dynamic, multi-step image editing capabilities.

03

Benchmark encourages development of more reasoning-aware, intelligent image editing models.

Abstract

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

Interesting problem: evaluation beyond single-shot edits. Attempt to structure multi-step edits with sub-tasks.

Weaknesses

I have several issues with this paper, and not sure where should I start. But here is my attempt. **Major Conceptual Issues** 1. Fundamental Mischaracterization of the Task. My biggest concern is that the paper conflates visual interpolation/transition generation with reasoning. Generating intermediate frames between two images is primarily an interpolation or perhaps an animation task, not a reasoning task. The claim that this measures "procedural reasoning" and "causal understanding" is e

Reviewer 02Rating 2Confidence 3

Strengths

- Well-motivated task: moves evaluation from “final image only” to “full editing trajectory,” which is missing in current benchmarks. - Process-oriented metrics: adding logical/scientific plausibility on top of standard vision metrics is a concrete contribution.

Weaknesses

- Small scale (237 cases) for a benchmark that aims to compare many models; unclear robustness. - Single judge dependency: relies heavily on one LMM evaluator; no human–LMM agreement or judge ablations.

Reviewer 03Rating 2Confidence 3

Strengths

* Propose a novel benchmark for image editing with new data and new evaluation methodology. * Propose a new task of evaluating generation quality along editing path.

Weaknesses

* This benchmark heavily relies on AI for both dataset construction and evaluation metrics. While this can be seen as a strength (automatization), the authours should validate the AI components. First, the generated images seem sometimes of poor quality. For example, I am concerned over the soundness of the "Science" part of the benchmark. Moreover, the evaluation methodology also relies on VLMs. This raises questions about robustness and correlation with human perception of the used VLM. How st

Reviewer 04Rating 4Confidence 3

Strengths

- **Novel evaluation perspective:** Focuses on the reasoning process of image editing rather than only final output quality — a relevant and underexplored dimension. - **Dataset diversity:** Includes varied categories (scientific, temporal, dynamic), providing a potentially useful testbed for assessing sequential or reasoning-based editing. - **Broad model coverage:** Evaluates many prominent image editing models, both open-source and proprietary.

Weaknesses

1. **Ad-hoc task taxonomy.** The division into four categories and sixteen sub-tasks lacks principled justification. Definitions between task types (e.g., “temporal sequence” vs “dynamic process”) are blurry. 2. **Unvalidated evaluation method.** The three novel metrics depend entirely on GPT-4o prompts with no human validation or correlation study. Human evaluation is essential to confirm alignment of VLM scores with human judgment. 3. **Undefined “accuracy.”** Table 1 reports “Accuracy” wi

Code & Models

Datasets

SZStrong/InEdit-Bench
dataset· 502 dl
502 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection