TL;DR
This paper introduces BRTS, a framework for on-policy distillation that selects the best teacher trajectory among multiple samples to improve reasoning performance on challenging benchmarks.
Contribution
BRTS enhances on-policy distillation by selecting the most aligned teacher trajectory from a pool, reducing supervision noise and improving reasoning accuracy.
Findings
BRTS outperforms standard OPD on reasoning benchmarks.
Largest gains observed on harder datasets.
Incorporates a ground-truth recovery step for difficult prompts.
Abstract
On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
