MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

Shichao Kan; Xuyang Zhang; Haojie Zhang; Zhe Zhu; Yigang Cen; Yixiong Liang; Lianlei Shan; Linna Zhang; Zhe Qu; Jiazhi Xia

arXiv:2605.06080·cs.CV·May 8, 2026

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

Shichao Kan, Xuyang Zhang, Haojie Zhang, Zhe Zhu, Yigang Cen, Yixiong Liang, Lianlei Shan, Linna Zhang, Zhe Qu, Jiazhi Xia

PDF

TL;DR

MSD-Score is a novel reference-free image caption evaluation metric that models multi-scale distributional similarities to better detect fine-grained mismatches and align with human judgments.

Contribution

It introduces a multi-scale distributional scoring framework using von Mises-Fisher mixtures, improving accuracy and diagnostics over existing reference-free metrics.

Findings

01

MSD-Score achieves state-of-the-art correlation with human judgments.

02

It provides transparent diagnostics of local grounding errors.

03

The probabilistic formulation offers a deterministic complement to holistic metrics.

Abstract

Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.