TL;DR
This paper introduces SSOPD, a self-supervised distillation method that leverages correct and wrong reasoning attempts to improve language models' reasoning abilities without external solution traces.
Contribution
The paper proposes SSOPD, a novel self-supervised distillation technique that uses intra-group contrast to enhance reasoning models' performance.
Findings
SSOPD outperforms GRPO across multiple benchmarks.
On Qwen3-8B, SSOPD achieves a macro Avg@12 of 65.6.
SSOPD improves model performance without external solution traces.
Abstract
GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
