Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

Haoxuan Chen; Tianming Liang; Wei-Shi Zheng; Jian-Fang Hu

arXiv:2605.11461·cs.AI·May 19, 2026

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

PDF

1 Repo

TL;DR

This paper introduces GCPO, a cooperative policy optimization method that enhances reasoning accuracy and diversity in large language models by shifting from competition to team-based reward sharing.

Contribution

It proposes a novel cooperative training paradigm for LLM reasoning that improves diversity and accuracy over traditional winner-takes-all approaches.

Findings

01

GCPO significantly outperforms existing methods in reasoning accuracy.

02

GCPO increases solution diversity in multiple benchmarks.

03

Team-level credit assignment enhances non-redundant reasoning paths.

Abstract

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bradybuddiemarch/gcpo
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.