SWITCH: Studying with Teacher for Knowledge Distillation of Large   Language Models

Jahyun Koo; Yerin Hwang; Yongil Kim; Taegwan Kang; Hyunkyung Bae,; Kyomin Jung

arXiv:2410.19503·cs.CL·April 23, 2025

SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae,, Kyomin Jung

PDF

Open Access 1 Video

TL;DR

SWITCH enhances knowledge distillation for large language models by selectively involving the teacher during student sequence generation, especially improving long sequence outputs and reducing noise from student-generated outputs.

Contribution

The paper introduces SWITCH, a novel method that strategically involves the teacher model during student training to improve long sequence generation in knowledge distillation.

Findings

01

SWITCH outperforms traditional KD methods across multiple datasets.

02

It significantly improves long sequence generation quality.

03

Experimental results show better alignment with teacher outputs.

Abstract

Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsKnowledge Distillation