Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation
Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing Wang, Shuming Shi,, Zhaopeng Tu, Michael Lyu

TL;DR
This paper investigates the effects of Seq2Seq pretraining in neural machine translation, revealing its benefits and limitations, and proposes strategies to enhance translation quality and robustness.
Contribution
It provides a detailed analysis of Seq2Seq pretraining impacts and introduces in-domain pretraining and input adaptation methods to address identified issues.
Findings
Seq2Seq pretraining improves translation diversity and reduces errors.
Discrepancies between pretraining and fine-tuning limit translation quality.
Proposed strategies enhance translation performance and robustness.
Abstract
In this paper, we present a substantial step in better understanding the SOTA sequence-to-sequence (Seq2Seq) pretraining for neural machine translation~(NMT). We focus on studying the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT. By carefully designing experiments on three language pairs, we find that Seq2Seq pretraining is a double-edged sword: On one hand, it helps NMT models to produce more diverse translations and reduce adequacy-related translation errors. On the other hand, the discrepancies between Seq2Seq pretraining and NMT finetuning limit the translation quality (i.e., domain discrepancy) and induce the over-estimation issue (i.e., objective discrepancy). Based on these observations, we further propose simple and effective strategies, named in-domain pretraining and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
