Exploring and Enhancing the Transfer of Distribution in Knowledge   Distillation for Autoregressive Language Models

Jun Rao; Xuebo Liu; Zepeng Lin; Liang Ding; Jing Li; Dacheng Tao; Min; Zhang

arXiv:2409.12512·cs.CL·September 23, 2024

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

Jun Rao, Xuebo Liu, Zepeng Lin, Liang Ding, Jing Li, Dacheng Tao, Min, Zhang

PDF

Open Access

TL;DR

This paper introduces Online Knowledge Distillation (OKD), a novel method that dynamically adapts the teacher model during training to improve distillation efficiency and effectiveness in autoregressive language models.

Contribution

The paper proposes OKD, which integrates online modules into the teacher model, enabling adaptive distillation without extensive on-policy sampling and reducing training time.

Findings

01

OKD achieves comparable or better performance than existing methods.

02

OKD reduces training time by up to four times.

03

OKD effectively adapts to various model architectures and sizes.

Abstract

Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. The success of KD in auto-regressive language models mainly relies on Reverse KL for mode-seeking and student-generated output (SGO) to combat exposure bias. Our theoretical analyses and experimental validation reveal that while Reverse KL effectively mimics certain features of the teacher distribution, it fails to capture most of its behaviors. Conversely, SGO incurs higher computational costs and presents challenges in optimization, particularly when the student model is significantly smaller than the teacher model. These constraints are primarily due to the immutable distribution of the teacher model, which fails to adjust adaptively to models of varying sizes. We introduce Online Knowledge Distillation (OKD), where the teacher network integrates small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsKnowledge Distillation