Towards Annotation-Free Evaluation of Cross-Lingual Image Captioning
Aozhu Chen, Xinyi Huang, Hailan Lin, Xirong Li

TL;DR
This paper introduces novel annotation-free metrics for evaluating cross-lingual image captioning, reducing reliance on human references and enabling effective assessment across languages and scenarios.
Contribution
It proposes new reference-free evaluation metrics, WMDRel, CLinRel, and CMedRel, for assessing cross-lingual image captions without requiring target language references.
Findings
WMDRel effectively measures semantic relevance using machine translation.
CLinRel captures visual-oriented cross-lingual relevance in deep feature space.
CMedRel enables zero-reference evaluation based on image content.
Abstract
Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
