LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu; Huiyu Duan; Bingnan Liu; Guangji Ma; Jiarui Wang; Liu Yang; Shiqi Gao; Xiaoyu Wang; Jia Wang; Xiongkuo Min; Guangtao Zhai; Weisi Lin

arXiv:2507.16193·cs.CV·September 9, 2025

LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin

PDF

Open Access

TL;DR

This paper introduces EBench-18K, a large-scale benchmark for Text-guided Image Editing evaluation, and proposes LMM4Edit, a multimodal model-based metric that aligns well with human preferences across multiple editing quality aspects.

Contribution

The paper presents the first large-scale benchmark for TIE evaluation and introduces LMM4Edit, a novel multimodal metric that comprehensively assesses image editing quality and alignment.

Findings

01

LMM4Edit outperforms existing metrics in correlating with human preferences.

02

EBench-18K provides extensive data for evaluating TIE models across multiple dimensions.

03

LMM4Edit demonstrates strong generalization in zero-shot evaluations.

Abstract

The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications