Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

TL;DR
Prune-OPD is a dynamic on-policy distillation framework that detects and mitigates prefix drift to improve training efficiency and performance in long-horizon reasoning tasks.
Contribution
It introduces real-time drift detection and adaptive rollout truncation to optimize supervision quality and computational resources during on-policy distillation.
Findings
Reduces training time by up to 68% on various benchmarks.
Maintains or improves performance when prefix drift occurs.
Automatically adjusts training window based on student-teacher compatibility.
Abstract
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top- overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
