The Effect of Domain and Diacritics in Yor\`ub\'a-English Neural Machine   Translation

David I. Adelani; Dana Ruiter; Jesujoba O. Alabi; Damilola Adebonojo,; Adesina Ayeni; Mofe Adeyemi; Ayodele Awokoya; Cristina Espa\~na-Bonet

arXiv:2103.08647·cs.CL·August 17, 2021·AfricaNLP·6 cites

The Effect of Domain and Diacritics in Yor\`ub\'a-English Neural Machine Translation

David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo,, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina Espa\~na-Bonet

PDF

Open Access 1 Repo 9 Models 2 Datasets

TL;DR

This paper introduces MENYO-20k, a standardized Yorùbá-English dataset, and evaluates neural machine translation models, highlighting the impact of diacritics on translation quality and outperforming existing multilingual models.

Contribution

The creation of MENYO-20k dataset and comprehensive benchmarks for Yorùbá-English translation, including analysis of diacritics' effects on translation quality.

Findings

01

Models outperform Google and Facebook M2M in Yorùbá translation.

02

Diacritics significantly influence translation quality and intelligibility.

03

Benchmark results set a new standard for Yorùbá-English neural machine translation.

Abstract

Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yor\`ub\'a--English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality, we also analyze the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uds-lsv/menyo-20k_mt
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications