Krey\`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi, Onesi, Claire Bizon Monroc, Lo\"ic Grobol, Hasan Muhammad, Ashi Garg, Naome, A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman,, Bismarck Bamfo Odoom, Sanjeev Khudanpur

TL;DR
This paper introduces the largest dataset and models for Creole language machine translation, covering 41 languages and 172 directions, significantly advancing low-resource language technology.
Contribution
It provides the largest Creole MT dataset with 14.5M sentences, publicly releases 11.6M, and develops models supporting all 41 languages, outperforming genre-specific models in many directions.
Findings
Outperforms genre-specific models on 26 of 34 directions
Largest dataset for Creole MT with 14.5 million sentences
Supports 41 languages in 172 translation directions
Abstract
A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
