On-Policy Distillation with Best-of-N Teacher Rollout Selection

Ke Zhang; Yunjie Tian; Dongdi Zhao; Yijiang Li; Yuanye Liu; Vishal M Patel; Di Fu

arXiv:2605.09725·cs.CV·May 14, 2026

On-Policy Distillation with Best-of-N Teacher Rollout Selection

Ke Zhang, Yunjie Tian, Dongdi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu

PDF

1 Repo

TL;DR

This paper introduces BRTS, a framework for on-policy distillation that selects the best teacher trajectory among multiple samples to improve reasoning performance on challenging benchmarks.

Contribution

BRTS enhances on-policy distillation by selecting the most aligned teacher trajectory from a pool, reducing supervision noise and improving reasoning accuracy.

Findings

01

BRTS outperforms standard OPD on reasoning benchmarks.

02

Largest gains observed on harder datasets.

03

Incorporates a ground-truth recovery step for difficult prompts.

Abstract

On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BWGZK-keke/BRTS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.