TL;DR
This paper critically examines the Anglocentric bias in cross-lingual embeddings, demonstrating the impact of hub language choice, expanding evaluation datasets, and proposing guidelines for more inclusive and effective multilingual embeddings.
Contribution
It challenges the default English hub assumption, expands evaluation datasets to include all language pairs, and provides guidelines for better cross-lingual embedding practices.
Findings
Hub language choice significantly affects performance.
Expanded evaluation datasets reveal new challenges.
Guidelines for robust multilingual embeddings.
Abstract
Most of recent work in cross-lingual word embeddings is severely Anglocentric. The vast majority of lexicon induction evaluation dictionaries are between English and another language, and the English embedding space is selected by default as the hub when learning in a multilingual setting. With this work, however, we challenge these practices. First, we show that the choice of hub language can significantly impact downstream lexicon induction performance. Second, we both expand the current evaluation dictionary collection to include all language pairs using triangulation, and also create new dictionaries for under-represented languages. Evaluating established methods over all these language pairs sheds light into their suitability and presents new challenges for the field. Finally, in our analysis we identify general guidelines for strong cross-lingual embeddings baselines, based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
