Weight Distillation: Transferring the Knowledge in Neural Network   Parameters

Ye Lin; Yanyang Li; Ziyang Wang; Bei Li; Quan Du; Tong Xiao; Jingbo; Zhu

arXiv:2009.09152·cs.CL·July 20, 2021

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo, Zhu

PDF

Open Access

TL;DR

This paper introduces Weight Distillation, a method to transfer knowledge from large neural networks to smaller ones using a parameter generator, achieving faster training and improved performance in machine translation tasks.

Contribution

It proposes a novel weight distillation technique that enhances model compression and acceleration by transferring parameter knowledge via a generator.

Findings

01

Small networks trained with weight distillation are 1.88-2.94x faster than large networks.

02

Weight distillation outperforms traditional knowledge distillation by 0.51-1.82 BLEU points.

03

The method maintains competitive translation performance with significantly reduced training time.

Abstract

Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsKnowledge Distillation