TL;DR
R-Zero introduces a fully autonomous, self-evolving framework for training large language models from scratch, eliminating reliance on human-labeled data and enabling continuous self-improvement in reasoning capabilities.
Contribution
The paper presents R-Zero, a novel autonomous training method where models generate and solve their own tasks, fostering self-improvement without human-curated datasets.
Findings
Significant reasoning performance improvements across multiple LLMs.
Boosts of +6.49 on math-reasoning benchmarks.
Boosts of +7.54 on general reasoning benchmarks.
Abstract
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving…
Peer Reviews
Decision·ICLR 2026 Poster
The paper reports consistent performance improvements across both mathematical and general reasoning benchmarks. It conducts rich and detailed ablation studies, revealing several interesting phenomena. 1. Models fine-tuned after R-Zero pretraining perform better than those fine-tuned directly. 2. Both task filtering and repetition penalty are shown to be essential components. 3. The paper identifies a model collapse phenomenon after multiple self-evolution iterations, with larger models show
Some of the reported improvements in Table 1 appear to be statistically insignificant, weakening the empirical strength of the main claims.
1. Generating novel data from scratch is a valuable research direction, where the co-evolving framework that includes a dual-agent setup is novel and insightful. 2. The experiments conducted in this work are extensive, and the empirical performance improvement appears to be large. 3. The ablation study is thorough and provides fruitful findings for future research in this direction.
1. The baseline studied in this work is relatively weak. There are other data generation approaches, such as Absolute Zero [1], which have been discussed but not directly compared empirically. 2. Meanwhile, as the author also mentions, the RLVR methods with zero-shot training objectives, such as maximizing the model confidence and entropy, also need to be compared against the data generation approaches, given that they are all a form of zero-shot approaches. [1] Zhao, A., Wu, Y., Yue, Y., Wu,
R-Zero's primary strength is its ability to create a self-improving loop for reasoning tasks without human-labeled alignment data. It successfully adapts the self-play paradigm to a domain that lacks a perfect external verifier (like a code executor or game engine), cleverly using a majority-vote mechanism to create a noisy but effective "pseudo-ground truth". The framework demonstrates significant and consistent performance gains across different model architectures (Qwen3, OctoThinker) and sc
The most significant limitation is that the self-improvement process is not indefinitely stable. After a few iterations, all tested models experience a "performance collapse," where their scores on benchmarks begin to decline. Larger models are more resilient and collapse later, but the eventual degradation appears inherent to the current framework. The performance collapse is directly linked to a decline in the quality of the training data. As the Challenger generates progressively harder prob
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
