Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks
Charith Peris, Lizhen Tan, Thomas Gueudre, Turan Gojayev, Pan Wei,, Gokmen Oz

TL;DR
This paper investigates the impact of transfer set choices on knowledge distillation for NLP tasks, finding that using target domain data yields better downstream performance than generic corpora, despite noisier teacher predictions.
Contribution
It provides empirical evidence that distilling from target domain data improves downstream NLP task performance over generic corpora, challenging conventional wisdom.
Findings
Target domain data improves downstream task performance.
Distillation from generic models benefits but is less effective.
Adding target data correlates with data similarity.
Abstract
Teacher-student knowledge distillation is a popular technique for compressing today's prevailing large language models into manageable sizes that fit low-latency downstream applications. Both the teacher and the choice of transfer set used for distillation are crucial ingredients in creating a high quality student. Yet, the generic corpora used to pretrain the teacher and the corpora associated with the downstream target domain are often significantly different, which raises a natural question: should the student be distilled over the generic corpora, so as to learn from high-quality teacher predictions, or over the downstream task corpora to align with finetuning? Our study investigates this trade-off using Domain Classification (DC) and Intent Classification/Named Entity Recognition (ICNER) as downstream tasks. We distill several multilingual students from a larger multilingual LM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsTest · ALIGN · Knowledge Distillation
