CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Ran Li; Zeyuan Liu; Yinghao Chen; Bingxiang He; Jiarui Yuan; Zixuan Fu; Weize Chen; Jinyi Hu; Zhiyuan Liu; Maosong Sun

arXiv:2602.02979·cs.CL·May 19, 2026

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Zhiyuan Liu, Maosong Sun

PDF

1 Repo

TL;DR

CPMobius introduces a collaborative Coach-Player framework for data-free reinforcement learning that enhances reasoning abilities in large language models without external data, outperforming existing unsupervised methods.

Contribution

It proposes a novel cooperative Coach-Player paradigm inspired by human sports collaboration, enabling data-free training of reasoning models with improved performance.

Findings

01

Achieves +4.9 overall accuracy improvement on Qwen2.5-Math-7B-Instruct.

02

Outperforms existing unsupervised approaches like RENT and R-zero.

03

Demonstrates effective reasoning skill enhancement without external data.

Abstract

Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPM\"obius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPM\"obius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/CPMobius
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.