Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings
Silvia Severini, Viktor Hangya, Masoud Jalili Sabet, Alexander Fraser,, Hinrich Sch\"utze

TL;DR
This paper demonstrates that simple, inexpensive cross-lingual signals like identical words and romanized word matching significantly improve unsupervised bilingual word embeddings, outperforming complex methods especially for distant language pairs.
Contribution
It highlights the importance of cheap, accessible signals in unsupervised BWE construction and shows their effectiveness across diverse language pairs, challenging the focus on complex unsupervised techniques.
Findings
Cheap signals outperform complex unsupervised methods on distant languages.
Identical words and romanized matching are effective seed signals.
Results are competitive with supervised approaches using high-quality lexicons.
Abstract
Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
