TL;DR
This paper critically evaluates fully unsupervised cross-lingual word embedding methods, revealing their limitations in resource-poor and distant language pairs, and compares their performance to weakly supervised approaches.
Contribution
It provides a comprehensive empirical analysis showing that fully unsupervised CLWE methods often fail or underperform compared to weakly supervised methods in challenging language pairs.
Findings
Fully unsupervised CLWE often yields zero performance for many language pairs.
Weakly supervised methods outperform unsupervised ones in all tested scenarios.
Unsupervised methods do not surpass the performance of seeded approaches with 500-1,000 translation pairs.
Abstract
Recent efforts in cross-lingual word embedding (CLWE) learning have predominantly focused on fully unsupervised approaches that project monolingual embeddings into a shared cross-lingual space without any cross-lingual signal. The lack of any supervision makes such approaches conceptually attractive. Yet, their only core difference from (weakly) supervised projection-based CLWE methods is in the way they obtain a seed dictionary used to initialize an iterative self-learning procedure. The fully unsupervised methods have arguably become more robust, and their primary use case is CLWE induction for pairs of resource-poor and distant languages. In this paper, we question the ability of even the most robust unsupervised CLWE approaches to induce meaningful CLWEs in these more challenging settings. A series of bilingual lexicon induction (BLI) experiments with 15 diverse languages (210…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
