TL;DR
This paper identifies a calibration issue in on-policy distillation of language models, proposes a new framework CaOPD to improve confidence calibration without sacrificing task performance, and demonstrates its effectiveness across models and domains.
Contribution
It introduces CaOPD, a calibration-aware distillation method that aligns confidence estimates with deployment conditions, addressing overconfidence in existing approaches.
Findings
CaOPD achieves Pareto-optimal calibration and maintains competitive capability.
CaOPD generalizes robustly under out-of-distribution and continual learning.
Teacher supervision under privileged context causes systematic optimism bias.
Abstract
On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
