Multi-Rollout On-Policy Distillation via Peer Successes and Failures
Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu

TL;DR
This paper introduces MOPD, a peer-conditioned on-policy distillation method that leverages multiple student rollouts, including successes and failures, to provide more informative training signals for large language models.
Contribution
MOPD is a novel framework that uses peer rollouts to improve on-policy distillation by incorporating both positive and negative evidence, enhancing model training.
Findings
MOPD outperforms standard on-policy baselines across multiple benchmarks.
Mixed success-failure contexts improve alignment with verifier rewards.
Exploiting multi-rollout behavior leads to more faithful supervision.
Abstract
Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
