DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Yuancheng Wei; Haojie Zhang; Linli Yao; Lei Li; Jiali Chen; Tao Huang; Yiting Lu; Duojun Huang; Xin Li; Zhao Zhong

arXiv:2605.04503·cs.CV·May 7, 2026

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Yuancheng Wei, Haojie Zhang, Linli Yao, Lei Li, Jiali Chen, Tao Huang, Yiting Lu, Duojun Huang, Xin Li, Zhao Zhong

PDF

TL;DR

DiffCap-Bench introduces a diverse, challenging benchmark for image difference captioning, emphasizing semantic accuracy and reasoning, with an evaluation protocol that aligns well with human judgments and downstream tasks.

Contribution

This work presents a comprehensive IDC benchmark with an LLM-based evaluation protocol, addressing diversity, complexity, and evaluation limitations of prior benchmarks.

Findings

01

Significant performance gaps between proprietary and open-source models.

02

Reasoning capability is critical for accurate difference captioning.

03

Strong correlation between benchmark results and downstream image editing quality.

Abstract

Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.