LLM-Free Image Captioning Evaluation in Reference-Flexible Settings
Shinnosuke Hirano, Yuiga Wada, Kazuki Matsuda, Seitaro Otsuki, Komei Sugiura

TL;DR
This paper introduces Pearl, an LLM-free supervised metric for image captioning evaluation that works in both reference-based and reference-free settings, outperforming existing metrics and addressing neutrality issues of LLM-based metrics.
Contribution
Pearl is a novel LLM-free supervised metric that learns similarity representations for image-caption and caption-caption comparisons, applicable in multiple evaluation settings.
Findings
Pearl outperforms existing LLM-free metrics on multiple datasets.
A large human-annotated dataset with 333k judgments was created for evaluation.
Pearl maintains neutrality and high performance in both reference-based and reference-free evaluations.
Abstract
We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification
