R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang; Wenhao Yu; Xiaoyang Wang; Hongming Zhang; Zongxia Li; Ruosen Li; Jiaxin Huang; Haitao Mi; Dong Yu

arXiv:2508.05004·cs.LG·February 16, 2026

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu

PDF

1 Models 3 Reviews

TL;DR

R-Zero introduces a fully autonomous, self-evolving framework for training large language models from scratch, eliminating reliance on human-labeled data and enabling continuous self-improvement in reasoning capabilities.

Contribution

The paper presents R-Zero, a novel autonomous training method where models generate and solve their own tasks, fostering self-improvement without human-curated datasets.

Findings

01

Significant reasoning performance improvements across multiple LLMs.

02

Boosts of +6.49 on math-reasoning benchmarks.

03

Boosts of +7.54 on general reasoning benchmarks.

Abstract

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper reports consistent performance improvements across both mathematical and general reasoning benchmarks. It conducts rich and detailed ablation studies, revealing several interesting phenomena. 1. Models fine-tuned after R-Zero pretraining perform better than those fine-tuned directly. 2. Both task filtering and repetition penalty are shown to be essential components. 3. The paper identifies a model collapse phenomenon after multiple self-evolution iterations, with larger models show

Weaknesses

Some of the reported improvements in Table 1 appear to be statistically insignificant, weakening the empirical strength of the main claims.

Reviewer 02Rating 6Confidence 4

Strengths

1. Generating novel data from scratch is a valuable research direction, where the co-evolving framework that includes a dual-agent setup is novel and insightful. 2. The experiments conducted in this work are extensive, and the empirical performance improvement appears to be large. 3. The ablation study is thorough and provides fruitful findings for future research in this direction.

Weaknesses

1. The baseline studied in this work is relatively weak. There are other data generation approaches, such as Absolute Zero [1], which have been discussed but not directly compared empirically. 2. Meanwhile, as the author also mentions, the RLVR methods with zero-shot training objectives, such as maximizing the model confidence and entropy, also need to be compared against the data generation approaches, given that they are all a form of zero-shot approaches. [1] Zhao, A., Wu, Y., Yue, Y., Wu,

Reviewer 03Rating 8Confidence 4

Strengths

R-Zero's primary strength is its ability to create a self-improving loop for reasoning tasks without human-labeled alignment data. It successfully adapts the self-play paradigm to a domain that lacks a perfect external verifier (like a code executor or game engine), cleverly using a majority-vote mechanism to create a noisy but effective "pseudo-ground truth". The framework demonstrates significant and consistent performance gains across different model architectures (Qwen3, OctoThinker) and sc

Weaknesses

The most significant limitation is that the self-improvement process is not indefinitely stable. After a few iterations, all tested models experience a "performance collapse," where their scores on benchmarks begin to decline. Larger models are more resilient and collapse later, but the eventual degradation appears inherent to the current framework. The performance collapse is directly linked to a decline in the quality of the training data. As the Challenger generates progressively harder prob

Code & Models

Models

🤗
spartan8806/atles-champion-embedding
model· 55 dl· ♡ 1
55 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.