GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan

TL;DR
GIE-Bench introduces a grounded, multi-dimensional benchmark for evaluating text-guided image editing models, focusing on functional correctness and content preservation, with over 1000 examples and automatic metrics validated against human ratings.
Contribution
The paper presents a novel benchmark with automatic evaluation metrics for assessing the accuracy and content preservation of text-guided image editing models.
Findings
GPT-Image-1 achieves high instruction-following accuracy
Current models tend to over-modify irrelevant regions
GIE-Bench enables scalable, reproducible evaluation
Abstract
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each…
Peer Reviews
Decision·Submitted to ICLR 2026
1. It's not just about whether the edit happened, but also about what didn't happen. By separating "functional correctness" from "content preservation," the benchmark gives a much more complete picture of a model's performance. 2. Using object masks to evaluate only the unedited parts of an image is a brilliant move. It stops penalizing a model for making the correct change and focuses squarely on unintended collateral damage. 3. Fully Automated & Scalable: The entire pipeline—from generating q
1. Only handles one-shot edit. The benchmark is designed for simple, single-step instructions ("change the car to red"). It can't evaluate more complex, real-world scenarios where a user might give a series of commands or have a back-and-forth conversation to refine an image. 2. The introduction of a target mask is of limited importance to the development of current image editing benchmarks.
- Clear, two-axis evaluation that disentangles *did the edit happen?* from *what collateral damage occurred?*—a practical and under-measured trade-off in editing. - Object-aware preservation via GroundingDINO→SAM masks is a concrete improvement over global CLIP/LPIPS that confound edits with preservation. - Operational details (geometric alignment before pixel metrics; per-edit-type breakdown; deterministic judging; a second judge) increase reproducibility and confidence. - Breadth of cove
- MCQs and correctness judgments rely on frontier VLMs (GPT-4o/Gemini). This raises *construct validity* and *reproducibility* questions (model updates, access, and potential judge–system coupling). Publishing non-proprietary baselines (e.g., open-weights VLMs) would help. - Preservation hinges on the *inverted* object mask. Small mask errors (under/over-segmentation, ambiguous targets like “sky near horizon”) can mis-score preservation. Quantifying mask quality and its effect (e.g., via pertu
+ The work proposes to address an admittedly existing gap in the current image editing benchmarks. The protocols for assessing text-guided editing are generally global and fail to disentangle correctness from preservation. The proposal seems timely. + The automatic pipeline involving GPT, GroundingDINO, SAM and masked metrics is well-engineered. + The usage of multiple-choice VQA evaluation is more robust than binary yes/no formats (e.g., I2E-Bench), reducing chance accuracy and allowing large-s
- Although the introduction of QA-based functional correctness is interesting, the proposed benchmark, if I understand correctly, focus primarily on single-turn image editing. Multi-step, compositional, or iterative editing scenarios are missing and therefore limit the real-world applicability. - The scale of human evaluation seems limited. Human study in sec. 4.3 uses 100 examples with 4 annotators, which is relatively small to 1 k+ samples in the full benchmark. - The QA stage is heavily based
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
