Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Jeonghye Kim; Jiwon Jeon; Dongsheng Li; Yuqing Yang

arXiv:2605.10781·cs.LG·May 12, 2026

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

PDF

TL;DR

This paper introduces RLRT, a novel reinforcement learning approach that reverses self-distillation signals to enhance reasoning exploration in language models, leading to significant performance improvements.

Contribution

It proposes a new method, RLRT, which leverages reversed self-distillation signals to improve reasoning exploration and model performance in RLVR settings.

Findings

01

RLRT outperforms self-distillation baselines across multiple checkpoints.

02

Reversing teacher signals enables more effective reasoning exploration.

03

Information asymmetry is identified as a key design axis for RLVR.

Abstract

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.