TL;DR
MAD-OPD introduces a multi-agent debate framework to enhance on-policy distillation, surpassing single-teacher limits and stabilizing training in agentic tasks, achieving state-of-the-art results across multiple benchmarks.
Contribution
It proposes MAD-OPD, a novel multi-agent debate approach for on-policy distillation, and extends it to agentic tasks with a new divergence principle, improving performance significantly.
Findings
MAD-OPD outperforms all six baseline configurations.
It improves agentic average by 2.4% and code average by 3.7%.
The method ranks first across all tested configurations.
Abstract
On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
