Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Zhicheng Yang; Zhijiang Guo; Yifan Song; Minrui Xu; Yongxin Wang; Yiwei Wang; Xiaodan Liang; Jing Tang

arXiv:2605.07804·cs.LG·May 11, 2026

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

PDF

TL;DR

Prune-OPD is a dynamic on-policy distillation framework that detects and mitigates prefix drift to improve training efficiency and performance in long-horizon reasoning tasks.

Contribution

It introduces real-time drift detection and adaptive rollout truncation to optimize supervision quality and computational resources during on-policy distillation.

Findings

01

Reduces training time by up to 68% on various benchmarks.

02

Maintains or improves performance when prefix drift occurs.

03

Automatically adjusts training window based on student-teacher compatibility.

Abstract

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top- $k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.