Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

TL;DR
This paper introduces ATESD, a method that adaptively controls teacher exposure during self-distillation in large language models, leading to improved reasoning performance.
Contribution
It proposes a learnable exposure control mechanism for self-distillation, optimizing teacher-student training dynamics based on future student improvement.
Findings
ATESD outperforms existing self-distillation methods on multiple benchmarks.
Adaptive exposure control improves reasoning accuracy over fixed exposure strategies.
The learned controller effectively balances teacher guidance and student learning progress.
Abstract
On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
