Reinforcement Learning Teachers of Test Time Scaling

Edoardo Cetin; Tianyu Zhao; Yujin Tang

arXiv:2506.08388·cs.LG·October 30, 2025

Reinforcement Learning Teachers of Test Time Scaling

Edoardo Cetin, Tianyu Zhao, Yujin Tang

PDF

3 Models 1 Video

TL;DR

This paper introduces Reinforcement-Learned Teachers (RLTs), a new framework for training reasoning language models that focus on effective downstream distillation without exploration challenges, outperforming larger models in various tasks.

Contribution

The paper presents RLTs trained with dense rewards that serve as efficient teachers for distilling smaller models, improving performance and re-usability in reasoning tasks.

Findings

01

RLTs outperform larger models in competition and graduate-level tasks.

02

RLTs maintain effectiveness when training larger students and on out-of-distribution tasks.

03

The approach reduces reliance on exploration in reinforcement learning for reasoning models.

Abstract

Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Reinforcement Learning Teachers of Test Time Scaling· slideslive