Training Cross-Lingual embeddings for Setswana and Sepedi

Mack Makgatho; Vukosi Marivate; Tshephisho Sefara; Valencia Wagner

arXiv:2111.06230·cs.CL·March 1, 2022

Training Cross-Lingual embeddings for Setswana and Sepedi

Mack Makgatho, Vukosi Marivate, Tshephisho Sefara, Valencia Wagner

PDF

Open Access 1 Repo

TL;DR

This paper develops cross-lingual word embeddings for Setswana and Sepedi using VecMap, aiming to improve NLP capabilities for these underrepresented African languages by leveraging monolingual data and semantic evaluation.

Contribution

It introduces a novel approach to create cross-lingual embeddings for Setswana and Sepedi, and releases a new semantic similarity dataset for these languages.

Findings

01

Cross-lingual embeddings improve semantic similarity representation.

02

The semantic similarity dataset for Setswana and Sepedi is publicly available.

03

Intrinsic evaluation shows enhanced semantic understanding in the embeddings.

Abstract

African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the lack of data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi in order to do a cross-lingual transfer. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dsfsi/embedding-eval-data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis