Self-Distilled RLVR

Chenxu Yang; Chuanyu Qin; Qingyi Si; Minghui Chen; Naibin Gu; Dingyu Yao; Zheng Lin; Weiping Wang; Jiaqi Wang; Nan Duan

arXiv:2604.03128·cs.LG·April 9, 2026

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan

PDF

1 Datasets

TL;DR

This paper introduces RLSD, a novel training method combining RLVR and self-distillation to improve convergence and stability in reinforcement learning with verifiable rewards.

Contribution

It proposes RLSD, which leverages self-distillation for fine-grained updates while using RLVR for reliable environmental feedback, addressing issues of information leakage and instability.

Findings

01

RLSD achieves higher convergence ceiling.

02

RLSD demonstrates improved training stability.

03

RLSD outperforms previous methods in experiments.

Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

iieycx/rlsd-train-MMFineReason-123K
dataset· 155 dl
155 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.