Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

TL;DR
This paper introduces RLRT, a novel reinforcement learning approach that reverses self-distillation signals to enhance reasoning exploration in language models, leading to significant performance improvements.
Contribution
It proposes a new method, RLRT, which leverages reversed self-distillation signals to improve reasoning exploration and model performance in RLVR settings.
Findings
RLRT outperforms self-distillation baselines across multiple checkpoints.
Reversing teacher signals enables more effective reasoning exploration.
Information asymmetry is identified as a key design axis for RLVR.
Abstract
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
