Knowledge Distillation Transfer Sets and their Impact on Downstream NLU   Tasks

Charith Peris; Lizhen Tan; Thomas Gueudre; Turan Gojayev; Pan Wei,; Gokmen Oz

arXiv:2210.04834·cs.CL·October 19, 2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Charith Peris, Lizhen Tan, Thomas Gueudre, Turan Gojayev, Pan Wei,, Gokmen Oz

PDF

Open Access

TL;DR

This paper investigates the impact of transfer set choices on knowledge distillation for NLP tasks, finding that using target domain data yields better downstream performance than generic corpora, despite noisier teacher predictions.

Contribution

It provides empirical evidence that distilling from target domain data improves downstream NLP task performance over generic corpora, challenging conventional wisdom.

Findings

01

Target domain data improves downstream task performance.

02

Distillation from generic models benefits but is less effective.

03

Adding target data correlates with data similarity.

Abstract

Teacher-student knowledge distillation is a popular technique for compressing today's prevailing large language models into manageable sizes that fit low-latency downstream applications. Both the teacher and the choice of transfer set used for distillation are crucial ingredients in creating a high quality student. Yet, the generic corpora used to pretrain the teacher and the corpora associated with the downstream target domain are often significantly different, which raises a natural question: should the student be distilled over the generic corpora, so as to learn from high-quality teacher predictions, or over the downstream task corpora to align with finetuning? Our study investigates this trade-off using Domain Classification (DC) and Intent Classification/Named Entity Recognition (ICNER) as downstream tasks. We distill several multilingual students from a larger multilingual LM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsTest · ALIGN · Knowledge Distillation