Self-Supervised On-Policy Distillation for Reasoning Language Models

Zhiquan Tan; Yinrong Hong

arXiv:2605.17497·cs.LG·May 19, 2026

Self-Supervised On-Policy Distillation for Reasoning Language Models

Zhiquan Tan, Yinrong Hong

PDF

1 Repo

TL;DR

This paper introduces SSOPD, a self-supervised distillation method that leverages correct and wrong reasoning attempts to improve language models' reasoning abilities without external solution traces.

Contribution

The paper proposes SSOPD, a novel self-supervised distillation technique that uses intra-group contrast to enhance reasoning models' performance.

Findings

01

SSOPD outperforms GRPO across multiple benchmarks.

02

On Qwen3-8B, SSOPD achieves a macro Avg@12 of 65.6.

03

SSOPD improves model performance without external solution traces.

Abstract

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tzq1999/SSOPD
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.