Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs
Cheng Feng, Chaoliang Zhong, Jun Sun, Yusuke Oishi

TL;DR
This paper introduces Scheduled Checkpoint Distillation, a novel method enabling smaller models to outperform their larger teachers on domain-specific tasks by strategically balancing subdomain performance during training.
Contribution
The paper proposes a new distillation technique that leverages a theoretical insight and adaptive weighting to improve domain-specific model performance beyond the teacher.
Findings
Outperforms existing distillation methods across multiple domain tasks.
Enables student models to match or surpass teacher performance.
Effective in multilingual and diverse NLP tasks.
Abstract
Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Intelligent Tutoring Systems and Adaptive Learning
