Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
Alex Jones, Isaac Caswell, Ishank Saxena, Orhan Firat

TL;DR
This paper demonstrates that lexical data augmentation using bilingual lexica significantly improves unsupervised multilingual machine translation, especially when using carefully curated lexica, and introduces a new multilingual lexicon for low-resource languages.
Contribution
It introduces a practical approach to enhance unsupervised multilingual translation with lexical data augmentation and provides a new high-quality multilingual lexicon for low-resource languages.
Findings
Lexical data augmentation yields sizable translation improvements.
Different augmentation methods provide similar gains and can be combined.
Curated lexica outperform noisier, larger lexica, especially for bigger models.
Abstract
Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsTest
