Sequence-Level Knowledge Distillation

Yoon Kim; Alexander M. Rush

arXiv:1606.07947·cs.CL·September 23, 2016·117 cites

Sequence-Level Knowledge Distillation

Yoon Kim, Alexander M. Rush

PDF

Open Access 5 Repos

TL;DR

This paper introduces sequence-level knowledge distillation for neural machine translation, creating smaller, faster models that maintain high performance and even outperform baseline models without distillation.

Contribution

It proposes novel sequence-level knowledge distillation methods for NMT, reducing model size and inference time while preserving translation quality.

Findings

01

Student models are 10 times faster than teacher models.

02

Knowledge distillation improves BLEU scores over baseline.

03

Pruned models have 13 times fewer parameters with minimal BLEU loss.

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsKnowledge Distillation