Improving Yor\`ub\'a Diacritic Restoration
Iroro Orife, David I. Adelani, Timi Fasubaa, Victor Williamson,, Wuraola Fisayo Oyewusi, Olamilekan Wahab, Kola Tubosun

TL;DR
This paper enhances Yorùbá diacritic restoration by expanding datasets from diverse sources, evaluating models on modern news text, and releasing resources openly to support Yorùbá NLP development.
Contribution
It introduces a significantly enlarged Yorùbá dataset from multiple sources and evaluates diacritic restoration models on contemporary text, with all resources openly available.
Findings
Improved dataset from diverse sources with millions of tokens.
Evaluation of diacritic restoration models on modern journalistic text.
Open-source release of datasets, models, and code.
Abstract
Yor\`ub\'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yor\`ub\'a dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yor\`ub\'a evaluation dataset of modern…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAgriculture and Rural Development Research · Botany and Geology in Latin America and Caribbean
