Towards Understanding Self-play for LLM Reasoning
Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi

TL;DR
This paper investigates how self-play improves large language model reasoning by analyzing training dynamics and comparing it with other methods, revealing its mechanisms, limitations, and future potential.
Contribution
It provides a detailed analysis of self-play training dynamics for LLM reasoning, comparing it with RLVR and SFT, and explores factors influencing reasoning performance.
Findings
Self-play differs from RLVR and SFT in parameter update sparsity.
Entropy dynamics of token distributions are linked to reasoning performance.
Limitations of self-play highlight areas for future improvement.
Abstract
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
