TL;DR
This paper introduces GCPO, a cooperative policy optimization method that enhances reasoning accuracy and diversity in large language models by shifting from competition to team-based reward sharing.
Contribution
It proposes a novel cooperative training paradigm for LLM reasoning that improves diversity and accuracy over traditional winner-takes-all approaches.
Findings
GCPO significantly outperforms existing methods in reasoning accuracy.
GCPO increases solution diversity in multiple benchmarks.
Team-level credit assignment enhances non-redundant reasoning paths.
Abstract
Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
