LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

Shinnosuke Hirano; Yuiga Wada; Kazuki Matsuda; Seitaro Otsuki; Komei Sugiura

arXiv:2512.21582·cs.CV·December 29, 2025

LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

Shinnosuke Hirano, Yuiga Wada, Kazuki Matsuda, Seitaro Otsuki, Komei Sugiura

PDF

Open Access 1 Video

TL;DR

This paper introduces Pearl, an LLM-free supervised metric for image captioning evaluation that works in both reference-based and reference-free settings, outperforming existing metrics and addressing neutrality issues of LLM-based metrics.

Contribution

Pearl is a novel LLM-free supervised metric that learns similarity representations for image-caption and caption-caption comparisons, applicable in multiple evaluation settings.

Findings

01

Pearl outperforms existing LLM-free metrics on multiple datasets.

02

A large human-annotated dataset with 333k judgments was created for evaluation.

03

Pearl maintains neutrality and high performance in both reference-based and reference-free evaluations.

Abstract

We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLM-Free Image Captioning Evaluation in Reference-Flexible Settings· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification