TL;DR
This paper introduces Reinforcement-Learned Teachers (RLTs), a new framework for training reasoning language models that focus on effective downstream distillation without exploration challenges, outperforming larger models in various tasks.
Contribution
The paper presents RLTs trained with dense rewards that serve as efficient teachers for distilling smaller models, improving performance and re-usability in reasoning tasks.
Findings
RLTs outperform larger models in competition and graduate-level tasks.
RLTs maintain effectiveness when training larger students and on out-of-distribution tasks.
The approach reduces reliance on exploration in reinforcement learning for reasoning models.
Abstract
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
