Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke; Zichen Wen; Weijia Li; Conghui He; Linfeng Zhang

arXiv:2605.13255·cs.AI·May 14, 2026

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

PDF

TL;DR

This paper introduces EGRSD and CL-EGRSD, novel on-policy self-distillation methods that adaptively weight token-level signals based on entropy, improving reasoning accuracy in large language models.

Contribution

The paper proposes entropy-guided self-distillation techniques that dynamically adjust token weights, enhancing reasoning performance over existing uniform-weight methods.

Findings

01

EGRSD and CL-EGRSD improve accuracy-length trade-offs in large language models.

02

The methods effectively down-weight high-entropy tokens, leading to better reasoning.

03

Experiments on Qwen models demonstrate state-of-the-art results among trainable methods.

Abstract

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.