Surprisal reveals diversity gaps in image captioning and different scorers change the story
Nikolai Ilinykh, Simon Dobnik

TL;DR
This paper introduces a surprisal-based metric to quantify linguistic diversity in image captioning, revealing that different evaluators can lead to contrasting conclusions about model performance.
Contribution
It proposes a new diversity metric based on surprisal variance and demonstrates the importance of using multiple scorers for robust evaluation.
Findings
Humans exhibit roughly twice the surprisal variance of models on MSCOCO.
Rescoring captions with a general-language model can invert diversity assessments.
Different evaluators can lead to contrasting conclusions about captioning diversity.
Abstract
We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Text Readability and Simplification · Natural Language Processing Techniques
