Mapping Supervised Bilingual Word Embeddings from English to low-resource languages
Sourav Dutta (1) ((1) Saarland University)

TL;DR
This paper explores mapping English and low-resource language embeddings into a shared space using supervised methods, enabling better NLP tasks like machine translation with limited bilingual data.
Contribution
It introduces a supervised approach for mapping bilingual embeddings in low-resource languages and discusses potential for unsupervised methods.
Findings
Supervised mapping achieves promising accuracy in bilingual retrieval tasks.
Bilingual data improves embedding alignment and translation quality.
Unsupervised approaches are viable when monolingual data is available.
Abstract
It is very challenging to work with low-resource languages due to the inadequate availability of data. Using a dictionary to map independently trained word embeddings into a shared vector space has proved to be very useful in learning bilingual embeddings in the past. Here we have tried to map individual embeddings of words in English and their corresponding translated words in low-resource languages like Estonian, Slovenian, Slovakian, and Hungarian. We have used a supervised learning approach. We report accuracy scores through various retrieval strategies which show that it is possible to approach challenging tasks in Natural Language Processing like machine translation for such languages, provided that we have at least some amount of proper bilingual data. We also conclude that we can follow an unsupervised learning path on monolingual text data as that is more suitable for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
