Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation
Yang Feng, Shuhao Gu, Dengji Guo, Zhengxin Yang, Chenze Shao

TL;DR
This paper proposes a novel training method for neural machine translation that incorporates future information via a seer decoder and uses knowledge distillation to improve the conventional decoder's performance, especially on larger datasets.
Contribution
Introduces a seer decoder to incorporate future context and uses knowledge distillation to enhance the standard decoder in neural machine translation.
Findings
Significant performance improvements on Chinese-English, English-German, and English-Romanian translation tasks.
Greater gains observed on larger datasets.
Knowledge distillation outperforms adversarial learning and L2 regularization for transferring knowledge.
Abstract
Although teacher forcing has become the main training paradigm for neural machine translation, it usually makes predictions only conditioned on past information, and hence lacks global planning for the future. To address this problem, we introduce another decoder, called seer decoder, into the encoder-decoder framework during training, which involves future information in target predictions. Meanwhile, we force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation. In this way, at test the conventional decoder can perform like the seer decoder without the attendance of it. Experiment results on the Chinese-English, English-German and English-Romanian translation tasks show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets. Besides, the experiments also prove knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
