Attentive Sequence-to-Sequence Learning for Diacritic Restoration of   Yor\`ub\'a Language Text

Iroro Orife

arXiv:1804.00832·cs.CL·October 31, 2018

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yor\`ub\'a Language Text

Iroro Orife

PDF

Open Access 1 Repo

TL;DR

This paper introduces an attentive sequence-to-sequence neural model for restoring diacritics in Yorùbá text, significantly improving accuracy and supporting NLP tasks for this under-resourced language.

Contribution

It reframes diacritic restoration as a machine translation problem and provides the first open-source models and datasets for Yorùbá diacritization.

Findings

01

Diacritization error rate below 5% on evaluation dataset

02

Open-source models and datasets released for Yorùbá language technology

03

Effective neural approach for diacritic restoration in Yorùbá

Abstract

Yor\`ub\'a is a widely spoken West African language with a writing system rich in tonal and orthographic diacritics. With very few exceptions, diacritics are omitted from electronic texts, due to limited device and application support. Diacritics provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any Yor\`ub\'a text-to-speech (TTS), automatic speech recognition (ASR) and natural language processing (NLP) tasks. Reframing Automatic Diacritic Restoration (ADR) as a machine translation task, we experiment with two different attentive Sequence-to-Sequence neural models to process undiacritized text. On our evaluation dataset, this approach produces diacritization error rates of less than 5%. We have released pre-trained models, datasets and source-code as an open-source project to advance efforts on Yor\`ub\'a language technology.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Niger-Volta-LTI/yoruba-adr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems