Curriculum Learning-Guided Progressive Distillation in Large Language Models
Jincheng Cao, Fanzhi Zeng, Leqi Liu, Aryan Mokhtari

TL;DR
This paper introduces CLPD, a framework that improves knowledge distillation in large language models by jointly optimizing data difficulty and teacher capacity scheduling, leading to better student performance.
Contribution
It proposes a unified curriculum learning approach that explicitly aligns data difficulty with teacher capacity during distillation, enhancing reasoning abilities in small models.
Findings
CLPD outperforms standard distillation methods on reasoning benchmarks.
Joint data and teacher curriculum improves student model capabilities.
Framework is modular and easily integrable with existing distillation algorithms.
Abstract
Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
