Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

Nathaniel Krasner; Nicholas Lanuzo; Antonios Anastasopoulos

arXiv:2505.13628·cs.CL·May 21, 2025

Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

Nathaniel Krasner, Nicholas Lanuzo, Antonios Anastasopoulos

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores using visual information from image-caption datasets to align multilingual sentence representations, providing an efficient alternative to bitexts, especially for low-resource languages, and demonstrates their effectiveness in cross-lingual tasks.

Contribution

It introduces a contrastive image-caption tuning method that aligns multilingual text representations via visual data, enabling cross-lingual NLU and retrieval without extensive multilingual training.

Findings

01

Visual alignment can implicitly align multilingual text representations.

02

Unseen languages in pretraining can be incorporated post-hoc.

03

Aligned representations improve cross-lingual NLU and retrieval performance.

Abstract

Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nkrasner/cl-clip-align
pytorchOfficial

Videos

Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsALIGN