TL;DR
CPMobius introduces a collaborative Coach-Player framework for data-free reinforcement learning that enhances reasoning abilities in large language models without external data, outperforming existing unsupervised methods.
Contribution
It proposes a novel cooperative Coach-Player paradigm inspired by human sports collaboration, enabling data-free training of reasoning models with improved performance.
Findings
Achieves +4.9 overall accuracy improvement on Qwen2.5-Math-7B-Instruct.
Outperforms existing unsupervised approaches like RENT and R-zero.
Demonstrates effective reasoning skill enhancement without external data.
Abstract
Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPM\"obius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPM\"obius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
