R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training
Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang

TL;DR
R-Diverse introduces a novel approach to self-play training for large language models by addressing the Diversity Illusion, thereby sustaining reasoning improvements over more iterations and outperforming prior methods.
Contribution
It proposes R-Diverse with Memory-Augmented Penalty and Skill-Aware Measurement to effectively mitigate diversity illusions in self-play training.
Findings
R-Diverse sustains reasoning gains over more self-play iterations.
It outperforms prior self-play methods across 10 reasoning benchmarks.
The approach effectively mitigates local and surface diversity illusions.
Abstract
Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Constraint Satisfaction and Optimization
