Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?
Gon\c{c}alo Gomes, Chrysoula Zerva, Bruno Martins

TL;DR
This paper evaluates the effectiveness of CLIP-based metrics for multilingual image captioning, demonstrating their strong correlation with human judgments across various languages and datasets.
Contribution
It introduces strategies for evaluating multilingual captioning with CLIPScore variants and shows their robustness across languages and complex linguistic challenges.
Findings
Multilingual CLIPScore models correlate well with human judgments.
Finetuned multilingual models generalize effectively across languages.
Machine-translated data supports high-quality multilingual caption evaluation.
Abstract
The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
