FLEUR: An Explainable Reference-Free Evaluation Metric for Image   Captioning Using a Large Multimodal Model

Yebin Lee; Imseong Park; and Myungjoo Kang

arXiv:2406.06004·cs.CV·June 11, 2024

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Yebin Lee, Imseong Park, and Myungjoo Kang

PDF

Open Access 1 Repo 1 Video

TL;DR

FLEUR is an explainable, reference-free image captioning evaluation metric leveraging a large multimodal model, providing scores and explanations aligned with human judgment without needing reference captions.

Contribution

This paper introduces FLEUR, a novel explainable, reference-free evaluation metric for image captioning using a large multimodal model, enhancing interpretability and reducing reliance on reference captions.

Findings

01

FLEUR achieves high correlation with human judgment across benchmarks.

02

FLEUR outperforms existing reference-free evaluation metrics.

03

FLEUR provides explanations for evaluation scores.

Abstract

Most existing image captioning evaluation metrics focus on assigning a single numerical score to a caption by comparing it with reference captions. However, these methods do not provide an explanation for the assigned score. Moreover, reference captions are expensive to acquire. In this paper, we propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics. By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions, and provide the explanation for the assigned score. We introduce score smoothing to align as closely as possible with human judgment and to be robust to user-defined grading criteria. FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks and reaches state-of-the-art results on Flickr8k-CF,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yebin46/fleur
noneOfficial

Videos

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsFocus · ALIGN