Synthetic Pre-Training Tasks for Neural Machine Translation
Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio, Feris

TL;DR
This paper investigates the effectiveness of pre-training neural machine translation models using synthetic data and tasks, aiming to reduce reliance on real-world data and mitigate associated ethical concerns.
Contribution
It introduces novel synthetic pre-training approaches involving obfuscated, concatenated, and purely synthetic data, providing empirical insights into their impact on NMT performance.
Findings
Pre-training benefits persist even with highly obfuscated data.
Synthetic parallel data can effectively replace real corpora.
Insights into factors influencing synthetic data effectiveness for NMT.
Abstract
Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
