Omnilingual MT: Machine Translation for 1,600 Languages
Omnilingual MT Team: Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo S\'anchez, Charles-Eric Saint-James, Ioannis Tsiamas, Xiang "Tony" Cao, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler

TL;DR
This paper introduces Omnilingual Machine Translation (OMT), supporting over 1,600 languages with high-quality translation enabled by a comprehensive data strategy and specialized large language models.
Contribution
The paper presents the first MT system supporting more than 1,600 languages, utilizing a novel data integration approach and model specialization techniques.
Findings
Models match or exceed 70B LLM baseline performance at smaller sizes.
OMT models significantly expand the set of languages with meaningful translation.
Evaluation datasets and leaderboard are publicly available and evolving.
Abstract
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
