CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
Soo Min Kwon, Ziteng Sun, Ananda Theertha Suresh, Himanshu Jain, Sanjiv Kumar

TL;DR
CoDistill-GRPO introduces a co-distillation method that trains large and small language models simultaneously, significantly enhancing small model performance and reducing training costs on mathematical reasoning benchmarks.
Contribution
This work presents a novel co-distillation algorithm that enables efficient training of small models by leveraging large models without requiring an oracle, improving performance and speed.
Findings
Small models trained with CoDistill-GRPO outperform standard GRPO on math benchmarks.
Large models with CoDistill-GRPO nearly match standard GRPO performance while using smaller rollouts.
CoDistill-GRPO achieves approximately 18% training speedup, reducing computational costs.
Abstract
Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
