Distillation Traps and Guards: A Calibration Knob for LLM Distillability
Weixiao Zhan, Yongcheng Jing, Leszek Rutkowski, Dacheng Tao

TL;DR
This paper identifies key challenges in knowledge distillation for LLMs, introduces a calibration method to control distillability, and demonstrates improved student performance and model protection across tasks.
Contribution
It presents a novel post-hoc calibration technique using reinforcement fine-tuning to regulate teacher distillability, enhancing transfer quality and safety.
Findings
Calibrated teachers improve student performance over baselines.
Undistillable teachers retain task performance but prevent student collapse.
Calibration offers a practical safety and IP protection mechanism.
Abstract
Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher's distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
