Reinforcement Learning via Self-Distillation

Jonas H\"ubotter; Frederike L\"ubeck; Lejs Behric; Anton Baumann; Marco Bagatella; Daniel Marta; Ido Hakimi; Idan Shenfeld; Thomas Kleine Buening; Carlos Guestrin; Andreas Krause

arXiv:2601.20802·cs.LG·February 17, 2026

Reinforcement Learning via Self-Distillation

Jonas H\"ubotter, Frederike L\"ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause

PDF

Open Access 2 Models

TL;DR

This paper introduces Self-Distillation Policy Optimization (SDPO), a reinforcement learning method that uses rich textual feedback to improve learning efficiency and accuracy in verifiable domains, surpassing traditional scalar reward methods.

Contribution

The paper presents SDPO, a novel reinforcement learning approach that leverages self-distillation from rich feedback without external teachers, enhancing sample efficiency and performance.

Findings

01

SDPO outperforms strong RLVR baselines in scientific reasoning, tool use, and programming tasks.

02

SDPO effectively uses implicit feedback from successful rollouts to improve learning.

03

SDPO accelerates discovery in binary-reward tasks with fewer attempts.

Abstract

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Algorithms