AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
Machel Reid, Junjie Hu, Graham Neubig, Yutaka Matsuo

TL;DR
This paper introduces AfroMT, a standardized benchmark and analysis toolkit for translating eight African languages, and proposes novel pretraining strategies that significantly improve translation quality in low-resource settings.
Contribution
It creates the first reproducible benchmark for African languages and develops new data augmentation pretraining methods tailored for low-resource multilingual translation.
Findings
Pretraining on 11 languages improves BLEU scores by up to 2 points.
Data augmentation strategies yield up to 12 BLEU points improvement in low-resource scenarios.
All code and models will be publicly released.
Abstract
Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
