Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

Shiqi Gao; Zitong Xu; Kang Fu; Huiyu Duan; Xiongkuo Min; Jia wang

arXiv:2603.19775·cs.CV·March 26, 2026

Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min, Jia wang

PDF

Open Access

TL;DR

This paper introduces TIEdit, a comprehensive benchmark for evaluating text-guided image editing, and EditProbe, an LLM-based intermediate-layer probing method that better aligns automatic evaluations with human perceptual judgments.

Contribution

The work presents a large-scale benchmark for systematic evaluation of TIE methods and proposes a novel LLM-based evaluator using intermediate-layer probing for improved assessment accuracy.

Findings

01

Automatic metrics show limited correlation with human judgments.

02

EditProbe significantly outperforms existing automatic evaluation methods.

03

TIEdit provides a diverse, expert-annotated dataset for TIE evaluation.

Abstract

Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Multimodal Machine Learning Applications