The Challenge of Diacritics in Yoruba Embeddings
Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

TL;DR
This paper investigates how diacritics affect Yoruba word embeddings, showing that undiacritized data improves performance and providing new evaluation sets for better assessment of embedding quality.
Contribution
It empirically demonstrates improved Yoruba embeddings from undiacritized data and introduces new analogy sets for evaluation.
Findings
Undiacritized Yoruba data yields better embedding performance.
Yoruba embeddings perform best on WordSim similarity test.
New analogy sets enable more comprehensive evaluation.
Abstract
The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation. The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
