Understanding Knowledge Distillation in Non-autoregressive Machine Translation
Chunting Zhou, Graham Neubig, Jiatao Gu

TL;DR
This paper investigates why knowledge distillation significantly improves non-autoregressive machine translation, revealing it reduces data complexity and aids modeling variations, leading to state-of-the-art results.
Contribution
It systematically analyzes the role of knowledge distillation in NAT, linking data complexity to model capacity and translation quality, and proposes methods to optimize data complexity.
Findings
Knowledge distillation reduces data complexity for NAT.
Optimal data complexity correlates with NAT model capacity.
Achieved state-of-the-art NAT performance on WMT14 En-De.
Abstract
Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation
