Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van, Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia,, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia, Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang

TL;DR
This paper reports on developing practical machine translation systems for over a thousand languages, focusing on dataset creation, model training, and evaluation challenges for underrepresented languages.
Contribution
It introduces methods for dataset collection, multilingual model training, and analysis of evaluation metrics for a vast number of languages, advancing inclusive machine translation.
Findings
Successful creation of datasets for 1500+ languages
Development of multilingual models for over 100 high-resource languages and 1000+ low-resource languages
Identification of limitations in current evaluation metrics and common error modes
Abstract
In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Machine Translation for a 1000 languages – Paper explained· youtube
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
