OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

Yuxiao Yang; Xiaoyun Wang; Weitong Zhang

arXiv:2605.12400·cs.LG·May 13, 2026

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

PDF

TL;DR

This paper introduces OGLS-SD, a novel on-policy self-distillation method that uses outcome-guided logit steering to improve large language model reasoning by addressing response bias issues.

Contribution

It proposes a new framework that leverages outcome rewards to calibrate teacher logits, enhancing self-distillation stability and reasoning accuracy.

Findings

01

Improved reasoning performance over standard OPSD.

02

Effective mitigation of response bias through outcome-guided logit steering.

03

Enhanced calibration of teacher responses leading to better model training.

Abstract

We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.