A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments
Omer Levy, Anders S{\o}gaard, Yoav Goldberg

TL;DR
This paper shows that many cross-lingual embedding algorithms perform similarly when using sentence ID features, and suggests that incorporating additional information sources could enhance future models.
Contribution
It provides empirical and theoretical analysis linking embedding and alignment methods, highlighting the importance of sentence ID features and proposing avenues for improvement.
Findings
Sentence ID features significantly impact performance
Traditional alignment algorithms perform comparably to embedding methods
Additional information sources could improve cross-lingual embeddings
Abstract
While cross-lingual word embeddings have been studied extensively in recent years, the qualitative differences between the different algorithms remain vague. We observe that whether or not an algorithm uses a particular feature set (sentence IDs) accounts for a significant performance gap among these algorithms. This feature set is also used by traditional alignment algorithms, such as IBM Model-1, which demonstrate similar performance to state-of-the-art embedding algorithms on a variety of benchmarks. Overall, we observe that different algorithmic approaches for utilizing the sentence ID feature space result in similar performance. This paper draws both empirical and theoretical parallels between the embedding and alignment literature, and suggests that adding additional sources of information, which go beyond the traditional signal of bilingual sentence-aligned corpora, may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
