The Challenge of Diacritics in Yoruba Embeddings

Tosin P. Adewumi; Foteini Liwicki; Marcus Liwicki

arXiv:2011.07605·cs.CL·November 17, 2020

The Challenge of Diacritics in Yoruba Embeddings

Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

PDF

Open Access 1 Repo

TL;DR

This paper investigates how diacritics affect Yoruba word embeddings, showing that undiacritized data improves performance and providing new evaluation sets for better assessment of embedding quality.

Contribution

It empirically demonstrates improved Yoruba embeddings from undiacritized data and introduces new analogy sets for evaluation.

Findings

01

Undiacritized Yoruba data yields better embedding performance.

02

Yoruba embeddings perform best on WordSim similarity test.

03

New analogy sets enable more comprehensive evaluation.

Abstract

The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation. The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tosingithub/ydesk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems