Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings
Phillip Keung, Julian Salazar, Yichao Lu, Noah A. Smith

TL;DR
This paper introduces an unsupervised approach using multilingual BERT and self-training to mine parallel sentences from unaligned text, significantly improving machine translation quality across multiple language pairs.
Contribution
It presents a novel unsupervised method for bitext mining and translation that leverages self-trained contextual embeddings, outperforming previous methods on standard benchmarks.
Findings
Up to 24.5 point increase in F1 scores on BUCC 2017 bitext mining.
Boosts in BLEU scores up to 3.5 on WMT translation tasks.
1.2 BLEU improvement on low-resource IWSLT'15 English-Vietnamese translation.
Abstract
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Weight Decay
