R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

Gengsheng Li; Jinghan He; Shijie Wang; Dan Zhang; Ruiqi Liu; Renrui Zhang; Zijun Yao; Junfeng Fang; Haiyun Guo; Jinqiao Wang

arXiv:2602.13103·cs.LG·February 17, 2026

R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang

PDF

Open Access

TL;DR

R-Diverse introduces a novel approach to self-play training for large language models by addressing the Diversity Illusion, thereby sustaining reasoning improvements over more iterations and outperforming prior methods.

Contribution

It proposes R-Diverse with Memory-Augmented Penalty and Skill-Aware Measurement to effectively mitigate diversity illusions in self-play training.

Findings

01

R-Diverse sustains reasoning gains over more self-play iterations.

02

It outperforms prior self-play methods across 10 reasoning benchmarks.

03

The approach effectively mitigates local and surface diversity illusions.

Abstract

Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Constraint Satisfaction and Optimization