CompBench: Benchmarking Complex Instruction-guided Image Editing
Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan Xie, Shaohui Lin

TL;DR
CompBench is a new large-scale benchmark designed to evaluate complex, instruction-guided image editing models, emphasizing fine-grained instructions, spatial reasoning, and diverse editing scenarios to identify current limitations.
Contribution
We introduce CompBench, a comprehensive benchmark with a novel instruction decoupling strategy and collaborative framework to better evaluate complex image editing tasks.
Findings
Current models show significant limitations on complex editing tasks.
CompBench reveals the need for advanced reasoning capabilities in image editing models.
The benchmark provides a foundation for developing more capable instruction-guided editing systems.
Abstract
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce CompBench, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, we propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. CompBench covers nine diverse editing tasks, ranging from local and multi-object edits to action, viewpoint, and implicit reasoning tasks. 2. The data pipeline leverages both MLLMs and repeated expert human verification.
1. The positioning with respect to the very latest complexity-aware or dynamic-editing benchmarks is incomplete. For example, ByteMorph [1] and ComplexBench-Edit [2], both of which focus on complexity-controllable/dynamic edits. 2. The work claims to cover 'action editing' and dynamic scene manipulations, yet the quantitative detail and qualitative discussion are not as deep or exhaustive as for local/multi-object cases. The action/location/viewpoint results are summarized in only a single tabl
The paper presents a dataset with a rich variety of categories and provides preliminary experiments demonstrating this diversity, confirming certain limitations of existing instruction-guided image editing models.
1. Limited dataset scale and utility: The dataset contains only ~3K samples, far from the claimed large-scale, and is better characterized as a test set rather than a full dataset. Its size and diversity are insufficient to support training new models or demonstrate substantial improvements. 2. Lack of methodological contribution: The work does not propose any new models or preliminary solutions leveraging the dataset. Merely highlighting the limitations of existing models without offering meth
+ The paper addresses a gap in existing benchmarks by focusing on complex, realistic editing scenarios that better reflect real-world applications. The instruction decomposition strategy along four dimensions (location, appearance, dynamics, objects) is well-motivated. The MLLM-human collaborative framework for data construction represents a creative approach to ensuring high-quality annotations. + The benchmark construction methodology is rigorous, involving multiple stages of quality control
+ While 3,000+ samples represent a substantial effort, the benchmark is relatively small compared to some existing datasets (e.g., UltraEdit with 4M samples). The reliance on MOSE video dataset as the primary source may introduce domain bias, potentially limiting the diversity of visual scenes and contexts. + The heavy reliance on CLIP-based metrics for foreground evaluation may inherit known limitations of CLIP in understanding fine-grained visual details and spatial relationships. The use of G
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
