A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models
Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Sung Ju Hwang,, Alexander Min

TL;DR
This paper investigates how to optimize Knowledge Distillation from Weak Teachers in NLP pre-training by examining model quality, loss weighting, and parameter remapping to enhance large language model training.
Contribution
It introduces specific guidelines and techniques for improving DWT in NLP, focusing on teacher quality, loss weighting, and initialization strategies.
Findings
Teacher model quality significantly affects DWT effectiveness.
Proper loss weighting improves student model performance.
Parameter remapping enhances initialization for DWT.
Abstract
Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance. Previous studies have shown that DWT can be effective in the vision domain and natural language processing (NLP) pre-training stage. Specifically, DWT shows promise in practical scenarios, such as enhancing new generation or larger models using pre-trained yet older or smaller models and lacking a resource budget. However, the optimal conditions for using DWT have yet to be fully investigated in NLP pre-training. Therefore, this study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation. These factors are: (i) the impact of teacher model quality on DWT effectiveness, (ii) guidelines for adjusting the weighting value for DWT loss, and (iii) the impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)
