SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques

TL;DR
SPIRAL introduces a self-play framework for training large language models through multi-turn zero-sum games, enabling autonomous development of reasoning skills without human supervision, and achieving significant performance improvements across multiple benchmarks.
Contribution
The paper presents a scalable, online multi-agent reinforcement learning system with role-conditioned advantage estimation for training LLMs via self-play, eliminating the need for human-curated data.
Findings
Up to 10% performance improvement on reasoning benchmarks.
Transfer of reasoning skills across different models and tasks.
Multi-game training enhances reasoning capabilities.
Abstract
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self-play training at scale, we implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10% across a suite of 8 reasoning benchmarks on 4…
Peer Reviews
Decision·ICLR 2026 Poster
1. Solid idea and algorithm design: Using self-play to construct an automatic curriculum for improving LLMs’ reasoning ability is a reasonable idea, and the proposed multi-turn, multi-agent RL framework is a novel approach. 2. Good empirical results: The method is thoroughly evaluated across multiple models and benchmarks, showing consistent and meaningful performance improvements. 3. Good clarity and presentation: The manuscript is well written and easy to follow. Figures and tables effectively
1. The second contribution (RAE) is relatively weak: Normalizing advantages separately for different agents in multi-agent settings is a common practice in classical MARL [1, 2]. Applying it to LLM-based multi-agent systems is quite straightforward, which makes the second contribution less substantial. 2. Limited performance on instruct models: In Table 1, SPIRAL shows a significant improvement on base models but much smaller gains (1–2%) on instruct models. This raises the question of whether t
* The paper systematically shows transfer from simple games to harder reasoning tasks, strengthening prior evidence that self-play improves LLM reasoning. * At face value, this paper provides strong gains across models, as shown in Table 1, with a relatively simple online policy optimization algorithm. These gains present an interesting avenue to future research, leveraging more robust and sample-efficient RL algorithms and game environment design. * The automatic evaluation methodology could
* Major issue: the authors did not report seeded policy optimization, and results do not include mean±STD for critical experiments. This is a must to assess reproducibility and statistical robustness, while results may appear cherry-picked. * Concerns about SFT dataset quality and size for a fair baseline: * Quality: The “expert” is unevaluated. Compare the generator to SPIRAL-trained models and to classic RL agents to establish competence. Report diversity of winning traces (e.g., % unique t
The paper is written in a clear and understandable manner, with a well-defined methodology and simple yet effective improvement strategies that are easy to follow. ● This work makes a major contribution toward the goal of self-improving LLMs by reducing the dependence on human-curated data. By using self-play as a source of unlimited training signal, SPIRAL facilitates significant model self-improvement. This approach could represent the next paradigm for RLVR, moving beyond the domains of math
While this is a great paper, there are several areas where further discussion or exploration could enhance its contribution. These points are primarily intended as constructive feedback instead of reasons to reject. ● The training resources are demanding, which could be a barrier to broader adoption. A discussion on potential avenues for improving computational efficiency (i.e. LoRA) would be a valuable addition. ● In line 412, it would be helpful if the authors could further elaborate on how
Code & Models
- 🤗spiral-rl/Spiral-Qwen3-4Bmodel· 11 dl· ♡ 411 dl♡ 4
- 🤗spiral-rl/Spiral-DeepSeek-R1-Distill-Qwen-7Bmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗caiyuchen/Spiral-step-0model
- 🤗caiyuchen/Spiral-step-1model· 1 dl1 dl
- 🤗caiyuchen/Spiral-step-2model
- 🤗caiyuchen/Spiral-step-4model· 1 dl1 dl
- 🤗caiyuchen/Spiral-step-3model
- 🤗caiyuchen/Spiral-step-6model· 1 dl1 dl
- 🤗caiyuchen/Spiral-step-5model
- 🤗caiyuchen/Spiral-step-8model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
MethodsShrink and Fine-Tune
