Understanding Knowledge Distillation in Non-autoregressive Machine   Translation

Chunting Zhou; Graham Neubig; Jiatao Gu

arXiv:1911.02727·cs.CL·February 24, 2021·33 cites

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Chunting Zhou, Graham Neubig, Jiatao Gu

PDF

Open Access

TL;DR

This paper investigates why knowledge distillation significantly improves non-autoregressive machine translation, revealing it reduces data complexity and aids modeling variations, leading to state-of-the-art results.

Contribution

It systematically analyzes the role of knowledge distillation in NAT, linking data complexity to model capacity and translation quality, and proposes methods to optimize data complexity.

Findings

01

Knowledge distillation reduces data complexity for NAT.

02

Optimal data complexity correlates with NAT model capacity.

03

Achieved state-of-the-art NAT performance on WMT14 En-De.

Abstract

Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation