What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Lorenzo Baraldi; Davide Bucciarelli; Federico Betti; Marcella Cornia; Lorenzo Baraldi; Nicu Sebe; Rita Cucchiara

arXiv:2505.20405·cs.CV·May 28, 2025

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara

PDF

Open Access

TL;DR

This paper introduces DICE, a multimodal model that detects and evaluates localized image edits based on instructions, improving alignment with human judgment and explainability in image editing evaluation.

Contribution

The paper presents DICE, a novel model combining difference detection and coherence estimation for instruction-guided image edits, trained with self-supervision and distillation techniques.

Findings

01

DICE accurately detects relevant image differences.

02

DICE's evaluations strongly correlate with human judgments.

03

The framework outperforms existing metrics in assessing image edits.

Abstract

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Digital Storytelling and Education

MethodsInpainting