Ensemble Distillation for Neural Machine Translation
Markus Freitag, Yaser Al-Onaizan, Baskaran Sankaran

TL;DR
This paper introduces a method for distilling ensemble and oracle BLEU teacher networks into a single neural machine translation model, improving translation quality and training efficiency without code changes.
Contribution
It presents a novel knowledge distillation approach for NMT that transfers ensemble and oracle BLEU performance into a smaller model, with a data filtering technique to speed up training and enhance quality.
Findings
Ensemble distillation improves translation quality.
Data filtering accelerates training and boosts performance.
Method is architecture-agnostic and easy to implement.
Abstract
Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. Translating a sentence with an Neural Machine Translation (NMT) engine is time expensive and having a smaller model speeds up this process. We demonstrate how to transfer the translation quality of an ensemble and an oracle BLEU teacher network into a single NMT system. Further, we present translation improvements from a teacher network that has the same architecture and dimensions of the student network. As the training of the student model is still expensive, we introduce a data filtering method based on the knowledge of the teacher model that not only speeds up the training, but also leads to better translation quality. Our techniques need no code change and can be easily reproduced with any NMT architecture to speed up the decoding process.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
