SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu; Leon Guertler; Simon Yu; Zichen Liu; Penghui Qi; Daniel Balcells; Mickel Liu; Cheston Tan; Weiyan Shi; Min Lin; Wee Sun Lee; and Natasha Jaques

arXiv:2506.24119·cs.AI·March 3, 2026

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques

PDF

Open Access 1 Repo 10 Models 1 Datasets 3 Reviews

TL;DR

SPIRAL introduces a self-play framework for training large language models through multi-turn zero-sum games, enabling autonomous development of reasoning skills without human supervision, and achieving significant performance improvements across multiple benchmarks.

Contribution

The paper presents a scalable, online multi-agent reinforcement learning system with role-conditioned advantage estimation for training LLMs via self-play, eliminating the need for human-curated data.

Findings

01

Up to 10% performance improvement on reasoning benchmarks.

02

Transfer of reasoning skills across different models and tasks.

03

Multi-game training enhances reasoning capabilities.

Abstract

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self-play training at scale, we implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10% across a suite of 8 reasoning benchmarks on 4…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Solid idea and algorithm design: Using self-play to construct an automatic curriculum for improving LLMs’ reasoning ability is a reasonable idea, and the proposed multi-turn, multi-agent RL framework is a novel approach. 2. Good empirical results: The method is thoroughly evaluated across multiple models and benchmarks, showing consistent and meaningful performance improvements. 3. Good clarity and presentation: The manuscript is well written and easy to follow. Figures and tables effectively

Weaknesses

1. The second contribution (RAE) is relatively weak: Normalizing advantages separately for different agents in multi-agent settings is a common practice in classical MARL [1, 2]. Applying it to LLM-based multi-agent systems is quite straightforward, which makes the second contribution less substantial. 2. Limited performance on instruct models: In Table 1, SPIRAL shows a significant improvement on base models but much smaller gains (1–2%) on instruct models. This raises the question of whether t

Reviewer 02Rating 4Confidence 3

Strengths

* The paper systematically shows transfer from simple games to harder reasoning tasks, strengthening prior evidence that self-play improves LLM reasoning. * At face value, this paper provides strong gains across models, as shown in Table 1, with a relatively simple online policy optimization algorithm. These gains present an interesting avenue to future research, leveraging more robust and sample-efficient RL algorithms and game environment design. * The automatic evaluation methodology could

Weaknesses

* Major issue: the authors did not report seeded policy optimization, and results do not include mean±STD for critical experiments. This is a must to assess reproducibility and statistical robustness, while results may appear cherry-picked. * Concerns about SFT dataset quality and size for a fair baseline: * Quality: The “expert” is unevaluated. Compare the generator to SPIRAL-trained models and to classic RL agents to establish competence. Report diversity of winning traces (e.g., % unique t

Reviewer 03Rating 8Confidence 4

Strengths

The paper is written in a clear and understandable manner, with a well-defined methodology and simple yet effective improvement strategies that are easy to follow. ● This work makes a major contribution toward the goal of self-improving LLMs by reducing the dependence on human-curated data. By using self-play as a source of unlimited training signal, SPIRAL facilitates significant model self-improvement. This approach could represent the next paradigm for RLVR, moving beyond the domains of math

Weaknesses

While this is a great paper, there are several areas where further discussion or exploration could enhance its contribution. These points are primarily intended as constructive feedback instead of reasons to reject. ● The training resources are demanding, which could be a barrier to broader adoption. A discussion on potential avenues for improving computational efficiency (i.e. LoRA) would be a valuable addition. ● In line 412, it would be helpful if the authors could further elaborate on how

Code & Models

Repositories

spiral-rl/spiral
noneOfficial

Models

Datasets

spiral-rl/Spiral-Kuhn-Poker-Qwen3-32B-SFT
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications

MethodsShrink and Fine-Tune