ContextRef: Evaluating Referenceless Metrics For Image Description Generation
Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber

TL;DR
This paper introduces ContextRef, a benchmark for evaluating referenceless image description metrics, revealing current models' limitations and the importance of context in aligning with human judgments.
Contribution
The paper presents ContextRef, a new benchmark with human ratings and robustness checks to evaluate and improve referenceless image description metrics.
Findings
Current models fail to perform well on ContextRef
Fine-tuning improves model performance significantly
Context dependence remains a major challenge
Abstract
Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsNone · ALIGN
