Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

TL;DR
This paper introduces EGRSD and CL-EGRSD, novel on-policy self-distillation methods that adaptively weight token-level signals based on entropy, improving reasoning accuracy in large language models.
Contribution
The paper proposes entropy-guided self-distillation techniques that dynamically adjust token weights, enhancing reasoning performance over existing uniform-weight methods.
Findings
EGRSD and CL-EGRSD improve accuracy-length trade-offs in large language models.
The methods effectively down-weight high-entropy tokens, leading to better reasoning.
Experiments on Qwen models demonstrate state-of-the-art results among trainable methods.
Abstract
On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
