Evaluation of Multilingual Image Captioning: How far can we get with   CLIP models?

Gon\c{c}alo Gomes; Chrysoula Zerva; Bruno Martins

arXiv:2502.06600·cs.CL·February 18, 2025

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

Gon\c{c}alo Gomes, Chrysoula Zerva, Bruno Martins

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates the effectiveness of CLIP-based metrics for multilingual image captioning, demonstrating their strong correlation with human judgments across various languages and datasets.

Contribution

It introduces strategies for evaluating multilingual captioning with CLIPScore variants and shows their robustness across languages and complex linguistic challenges.

Findings

01

Multilingual CLIPScore models correlate well with human judgments.

02

Finetuned multilingual models generalize effectively across languages.

03

Machine-translated data supports high-quality multilingual caption evaluation.

Abstract

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gecgomes/Multilingual_IC_Eval
noneOfficial

Videos

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsAttentive Walk-Aggregating Graph Neural Network