A Study on Knowledge Distillation from Weak Teacher for Scaling Up   Pre-trained Language Models

Hayeon Lee; Rui Hou; Jongpil Kim; Davis Liang; Sung Ju Hwang,; Alexander Min

arXiv:2305.18239·cs.CL·May 30, 2023·1 cites

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Sung Ju Hwang,, Alexander Min

PDF

Open Access 1 Repo

TL;DR

This paper investigates how to optimize Knowledge Distillation from Weak Teachers in NLP pre-training by examining model quality, loss weighting, and parameter remapping to enhance large language model training.

Contribution

It introduces specific guidelines and techniques for improving DWT in NLP, focusing on teacher quality, loss weighting, and initialization strategies.

Findings

01

Teacher model quality significantly affects DWT effectiveness.

02

Proper loss weighting improves student model performance.

03

Parameter remapping enhances initialization for DWT.

Abstract

Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance. Previous studies have shown that DWT can be effective in the vision domain and natural language processing (NLP) pre-training stage. Specifically, DWT shows promise in practical scenarios, such as enhancing new generation or larger models using pre-trained yet older or smaller models and lacking a resource budget. However, the optimal conditions for using DWT have yet to be fully investigated in NLP pre-training. Therefore, this study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation. These factors are: (i) the impact of teacher model quality on DWT effectiveness, (ii) guidelines for adjusting the weighting value for DWT loss, and (iii) the impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huggingface/transformers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)