GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Yusu Qian; Jiasen Lu; Tsu-Jui Fu; Xinze Wang; Chen Chen; Yinfei Yang; Wenze Hu; Zhe Gan

arXiv:2505.11493·cs.CV·July 28, 2025

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan

PDF

Open Access 1 Repo 3 Reviews

TL;DR

GIE-Bench introduces a grounded, multi-dimensional benchmark for evaluating text-guided image editing models, focusing on functional correctness and content preservation, with over 1000 examples and automatic metrics validated against human ratings.

Contribution

The paper presents a novel benchmark with automatic evaluation metrics for assessing the accuracy and content preservation of text-guided image editing models.

Findings

01

GPT-Image-1 achieves high instruction-following accuracy

02

Current models tend to over-modify irrelevant regions

03

GIE-Bench enables scalable, reproducible evaluation

Abstract

Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. It's not just about whether the edit happened, but also about what didn't happen. By separating "functional correctness" from "content preservation," the benchmark gives a much more complete picture of a model's performance. 2. Using object masks to evaluate only the unedited parts of an image is a brilliant move. It stops penalizing a model for making the correct change and focuses squarely on unintended collateral damage. 3. Fully Automated & Scalable: The entire pipeline—from generating q

Weaknesses

1. Only handles one-shot edit. The benchmark is designed for simple, single-step instructions ("change the car to red"). It can't evaluate more complex, real-world scenarios where a user might give a series of commands or have a back-and-forth conversation to refine an image. 2. The introduction of a target mask is of limited importance to the development of current image editing benchmarks.

Reviewer 02Rating 4Confidence 4

Strengths

- Clear, two-axis evaluation that disentangles *did the edit happen?* from *what collateral damage occurred?*—a practical and under-measured trade-off in editing. - Object-aware preservation via GroundingDINO→SAM masks is a concrete improvement over global CLIP/LPIPS that confound edits with preservation. - Operational details (geometric alignment before pixel metrics; per-edit-type breakdown; deterministic judging; a second judge) increase reproducibility and confidence. - Breadth of cove

Weaknesses

- MCQs and correctness judgments rely on frontier VLMs (GPT-4o/Gemini). This raises *construct validity* and *reproducibility* questions (model updates, access, and potential judge–system coupling). Publishing non-proprietary baselines (e.g., open-weights VLMs) would help. - Preservation hinges on the *inverted* object mask. Small mask errors (under/over-segmentation, ambiguous targets like “sky near horizon”) can mis-score preservation. Quantifying mask quality and its effect (e.g., via pertu

Reviewer 03Rating 6Confidence 2

Strengths

+ The work proposes to address an admittedly existing gap in the current image editing benchmarks. The protocols for assessing text-guided editing are generally global and fail to disentangle correctness from preservation. The proposal seems timely. + The automatic pipeline involving GPT, GroundingDINO, SAM and masked metrics is well-engineered. + The usage of multiple-choice VQA evaluation is more robust than binary yes/no formats (e.g., I2E-Bench), reducing chance accuracy and allowing large-s

Weaknesses

- Although the introduction of QA-based functional correctness is interesting, the proposed benchmark, if I understand correctly, focus primarily on single-turn image editing. Multi-step, compositional, or iterative editing scenarios are missing and therefore limit the real-world applicability. - The scale of human evaluation seems limited. Human study in sec. 4.3 uses 100 examples with 4 annotators, which is relatively small to 1 k+ samples in the full benchmark. - The QA stage is heavily based

Code & Models

Repositories

apple/ml-gie-bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Humanities and Scholarship · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications