English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings
Yau-Shian Wang, Ashley Wu, Graham Neubig

TL;DR
This paper introduces mSimCSE, a contrastive learning method that, surprisingly, learns high-quality universal cross-lingual sentence embeddings using only English data, without requiring parallel data, and performs well across multiple languages.
Contribution
The paper proposes mSimCSE, extending SimCSE to multilingual settings, demonstrating that English contrastive learning can produce effective cross-lingual embeddings without parallel data.
Findings
Unsupervised mSimCSE achieves performance comparable to supervised methods.
Significant improvements over previous methods in cross-lingual retrieval and multilingual STS.
Performance further enhanced with cross-lingual NLI data.
Abstract
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsSimCSE · Contrastive Learning
