Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach
Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min, Jia wang

TL;DR
This paper introduces TIEdit, a comprehensive benchmark for evaluating text-guided image editing, and EditProbe, an LLM-based intermediate-layer probing method that better aligns automatic evaluations with human perceptual judgments.
Contribution
The work presents a large-scale benchmark for systematic evaluation of TIE methods and proposes a novel LLM-based evaluator using intermediate-layer probing for improved assessment accuracy.
Findings
Automatic metrics show limited correlation with human judgments.
EditProbe significantly outperforms existing automatic evaluation methods.
TIEdit provides a diverse, expert-annotated dataset for TIE evaluation.
Abstract
Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Multimodal Machine Learning Applications
