Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yor\`ub\'a Language Text
Iroro Orife

TL;DR
This paper introduces an attentive sequence-to-sequence neural model for restoring diacritics in Yorùbá text, significantly improving accuracy and supporting NLP tasks for this under-resourced language.
Contribution
It reframes diacritic restoration as a machine translation problem and provides the first open-source models and datasets for Yorùbá diacritization.
Findings
Diacritization error rate below 5% on evaluation dataset
Open-source models and datasets released for Yorùbá language technology
Effective neural approach for diacritic restoration in Yorùbá
Abstract
Yor\`ub\'a is a widely spoken West African language with a writing system rich in tonal and orthographic diacritics. With very few exceptions, diacritics are omitted from electronic texts, due to limited device and application support. Diacritics provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any Yor\`ub\'a text-to-speech (TTS), automatic speech recognition (ASR) and natural language processing (NLP) tasks. Reframing Automatic Diacritic Restoration (ADR) as a machine translation task, we experiment with two different attentive Sequence-to-Sequence neural models to process undiacritized text. On our evaluation dataset, this approach produces diacritization error rates of less than 5%. We have released pre-trained models, datasets and source-code as an open-source project to advance efforts on Yor\`ub\'a language technology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
