Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

Barah Fazili; Koustava Goswami

arXiv:2602.21543·cs.CL·February 26, 2026

Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

Barah Fazili, Koustava Goswami

PDF

Open Access

TL;DR

This paper demonstrates that using multi-way parallel corpora for contrastive learning significantly enhances multilingual and cross-lingual representations, leading to improved performance on various natural language understanding tasks across multiple languages.

Contribution

The authors introduce a novel approach of leveraging multi-way parallel texts for contrastive training to improve cross-lingual alignment in pretrained models, surpassing traditional bilingual methods.

Findings

01

21.3% improvement in bitext mining

02

5.3% gain in semantic similarity

03

28.4% increase in classification accuracy

Abstract

Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks