Weight Distillation: Transferring the Knowledge in Neural Network Parameters
Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo, Zhu

TL;DR
This paper introduces Weight Distillation, a method to transfer knowledge from large neural networks to smaller ones using a parameter generator, achieving faster training and improved performance in machine translation tasks.
Contribution
It proposes a novel weight distillation technique that enhances model compression and acceleration by transferring parameter knowledge via a generator.
Findings
Small networks trained with weight distillation are 1.88-2.94x faster than large networks.
Weight distillation outperforms traditional knowledge distillation by 0.51-1.82 BLEU points.
The method maintains competitive translation performance with significantly reduced training time.
Abstract
Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
