Unsupervised Statistical Machine Translation
Mikel Artetxe, Gorka Labaka, Eneko Agirre

TL;DR
This paper introduces an unsupervised phrase-based SMT approach that leverages monolingual data and cross-lingual embeddings, significantly narrowing the performance gap with supervised translation systems.
Contribution
It presents a novel unsupervised SMT method that induces phrase tables from monolingual data and improves translation quality through iterative backtranslation.
Findings
Achieved 14.08 BLEU on English-German translation
Achieved 26.22 BLEU on English-French translation
Reduced gap with supervised SMT to 2-5 BLEU points
Abstract
While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
