Co-Evolving Policy Distillation
Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

TL;DR
This paper introduces Co-Evolving Policy Distillation (CoPD), a method that co-trains multiple experts bidirectionally to better consolidate diverse capabilities in a single model, outperforming existing paradigms.
Contribution
The paper proposes CoPD, a novel co-evolutionary training approach that integrates expert capabilities more effectively than traditional sequential or unidirectional methods.
Findings
CoPD outperforms mixed RLVR and MOPD baselines in integrating capabilities.
CoPD surpasses domain-specific experts in experiments.
CoPD enables a new training paradigm for scalable, multi-capability models.
Abstract
RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
