Fully Synthetic Data Improves Neural Machine Translation with Knowledge   Distillation

Alham Fikri Aji; Kenneth Heafield

arXiv:2012.15455·cs.CL·September 16, 2021

Fully Synthetic Data Improves Neural Machine Translation with Knowledge Distillation

Alham Fikri Aji, Kenneth Heafield

PDF

Open Access

TL;DR

This study demonstrates that fully synthetic data generated through round-trip translation enhances neural machine translation performance, especially when combining source and target monolingual data and considering test set provenance.

Contribution

It introduces a novel approach of using fully synthetic data via round-trip translation for knowledge distillation in neural machine translation.

Findings

01

Combining source and target monolingual data improves translation quality.

02

The effectiveness of data augmentation depends on test set language origin.

03

Round-trip translation of target language monolinguals yields significant gains.

Abstract

This paper explores augmenting monolingual data for knowledge distillation in neural machine translation. Source language monolingual text can be incorporated as a forward translation. Interestingly, we find the best way to incorporate target language monolingual text is to translate it to the source language and round-trip translate it back to the target language, resulting in a fully synthetic corpus. We find that combining monolingual data from both source and target languages yields better performance than a corpus twice as large only in one language. Moreover, experiments reveal that the improvement depends upon the provenance of the test set. If the test set was originally in the source language (with the target side written by translators), then forward translating source monolingual data matters. If the test set was originally in the target language (with the source written by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics

MethodsKnowledge Distillation