The Effect of Domain and Diacritics in Yor\`ub\'a-English Neural Machine Translation
David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo,, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina Espa\~na-Bonet

TL;DR
This paper introduces MENYO-20k, a standardized Yorùbá-English dataset, and evaluates neural machine translation models, highlighting the impact of diacritics on translation quality and outperforming existing multilingual models.
Contribution
The creation of MENYO-20k dataset and comprehensive benchmarks for Yorùbá-English translation, including analysis of diacritics' effects on translation quality.
Findings
Models outperform Google and Facebook M2M in Yorùbá translation.
Diacritics significantly influence translation quality and intelligibility.
Benchmark results set a new standard for Yorùbá-English neural machine translation.
Abstract
Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yor\`ub\'a--English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality, we also analyze the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Davlan/byt5-base-eng-yor-mtmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗Davlan/byt5-base-yor-eng-mtmodel· 1 dl· ♡ 21 dl♡ 2
- 🤗Davlan/m2m100_418M-eng-yor-mtmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗Davlan/m2m100_418M-yor-eng-mtmodel· 4 dl4 dl
- 🤗Davlan/mT5_base_yoruba_adrmodel· 145 dl145 dl
- 🤗Davlan/mbart50-large-eng-yor-mtmodel
- 🤗Davlan/mbart50-large-yor-eng-mtmodel· 2 dl2 dl
- 🤗Davlan/mt5_base_eng_yor_mtmodel· 7 dl7 dl
- 🤗Davlan/mt5_base_yor_eng_mtmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
