Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu (1); Hailey Weingord (2); Sejal Mittal (2); Prakhar Dungarwal (2); Anusha Nandula (2); Bo Ni (3); Samyadeep Basu (4); Hongjie Chen (5); Nesreen K. Ahmed (6); Li Li (7); Jiayi Zhang (8); Koustava Goswami (4); Subhojyoti Mukherjee (4); Branislav Kveton (4); Puneet Mathur (4); Franck Dernoncourt (4); Yue Zhao (7); Yu Wang (9); Ryan A. Rossi (4); Zhengzhong Tu (10); Hongru Du (1) ((1) University of Virginia; (2) Columbia University; (3) Vanderbilt University; (4) Adobe Research; (5) Dolby Laboratories; (6) Cisco Research; (7) University of Southern California; (8) University of Wisconsin-Madison; (9) University of Oregon; (10) Texas A&M University)

arXiv:2602.13028·cs.CV·February 16, 2026

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu (1), Hailey Weingord (2), Sejal Mittal (2), Prakhar Dungarwal (2), Anusha Nandula (2), Bo Ni (3), Samyadeep Basu (4), Hongjie Chen (5), Nesreen K. Ahmed (6), Li Li (7), Jiayi Zhang (8), Koustava Goswami (4), Subhojyoti Mukherjee (4), Branislav Kveton (4)

PDF

Open Access

TL;DR

This paper introduces a fine-grained MLLM-based evaluation framework for image editing that aligns closely with human judgments, addressing limitations of traditional metrics by providing detailed, interpretable assessments.

Contribution

It proposes a novel MLLM-based judging framework with twelve interpretable factors, a new benchmark integrating human and model evaluations, and empirical evidence of its effectiveness.

Findings

01

MLLM judges closely align with human evaluations

02

Traditional metrics often fail to capture fine-grained editing quality

03

The proposed framework provides more intuitive and reliable assessments

Abstract

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Digital Media Forensic Detection