Training Cross-Lingual embeddings for Setswana and Sepedi
Mack Makgatho, Vukosi Marivate, Tshephisho Sefara, Valencia Wagner

TL;DR
This paper develops cross-lingual word embeddings for Setswana and Sepedi using VecMap, aiming to improve NLP capabilities for these underrepresented African languages by leveraging monolingual data and semantic evaluation.
Contribution
It introduces a novel approach to create cross-lingual embeddings for Setswana and Sepedi, and releases a new semantic similarity dataset for these languages.
Findings
Cross-lingual embeddings improve semantic similarity representation.
The semantic similarity dataset for Setswana and Sepedi is publicly available.
Intrinsic evaluation shows enhanced semantic understanding in the embeddings.
Abstract
African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the lack of data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi in order to do a cross-lingual transfer. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
