Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh; Simon Dobnik

arXiv:2511.04754·cs.CL·November 10, 2025

Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh, Simon Dobnik

PDF

Open Access

TL;DR

This paper introduces a surprisal-based metric to quantify linguistic diversity in image captioning, revealing that different evaluators can lead to contrasting conclusions about model performance.

Contribution

It proposes a new diversity metric based on surprisal variance and demonstrates the importance of using multiple scorers for robust evaluation.

Findings

01

Humans exhibit roughly twice the surprisal variance of models on MSCOCO.

02

Rescoring captions with a general-language model can invert diversity assessments.

03

Different evaluators can lead to contrasting conclusions about captioning diversity.

Abstract

We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Text Readability and Simplification · Natural Language Processing Techniques