Can Large Reasoning Models Self-Train?

Sheikh Shafayat; Fahim Tajwar; Ruslan Salakhutdinov; Jeff Schneider; Andrea Zanette

arXiv:2505.21444·cs.LG·October 10, 2025

Can Large Reasoning Models Self-Train?

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette

PDF

Open Access 4 Datasets 3 Reviews

TL;DR

This paper investigates the potential of self-training in large reasoning models using reinforcement learning and majority voting, revealing both its benefits and critical limitations like reward hacking that hinder sustained self-improvement.

Contribution

It demonstrates that simple self-feedback mechanisms can improve reasoning performance but also identifies fundamental challenges such as reward hacking in prolonged self-training.

Findings

01

Self-training improves reasoning performance.

02

Better feedback quality enhances model learning.

03

Reward hacking causes performance collapse.

Abstract

Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The notion of the self rewarded training using self-generated data is quite novel. In previous methods, the most common way to do RL without verifiable reward is using teacher model as a verifier or the source of the knowledge distillation. However, in this paper, the proposed method only uses a single model that can both generate the answers and verify them by majority voting.

Weaknesses

1. I think this paper also should propose an alternative way beyond the majority vote. It is straightforward that majority vote has shortcut answer that is generating identical answers through different trajectories. Therefore, to show the validity of SRT, the authors should give proper examples on self-verification which do not have shortcut solutions. 2. The authors do not explain the main reason of the phenomenon that the performance of using SRT is quite competitive to its counterpart (i.e.

Reviewer 02Rating 2Confidence 5

Strengths

The authors tested many models including Llama3.1-8b, Qwen2.5-Math-7B, Qwen3-14B, and Deepseek-Math-7B, and also conducted tests on many tasks such as Reasoning GYM and Math reasoning. Moreover, the authors honestly reported at the end that long-term training leads to model collapse.

Weaknesses

1. This paper gives me a strong sense of disconnect. The front section spends a large amount of space introducing Self-Reward Training (SRT) (i.e., using majority voting as a supervision signal), but we can see that in most scenarios, this method is far inferior to directly training on Ground Truth Labels, as shown in Figure 3 and Figure 4. Furthermore, we know from the later text that this training method leads to a collapse phenomenon where the model's accuracy rapidly declines in the later st

Reviewer 03Rating 2Confidence 4

Strengths

1. This paper conducts training on both synthetic and mathematical reasoning tasks, and covers as many RL algorithms, model types, and datasets as possible to ensure comprehensive evaluation. 2. The paper provides thorough descriptions of experimental details, offering strong reproducibility.

Weaknesses

1. The paper defines self-improvement with a model “judging the correctness of its own outputs,” whereas the standard usage typically means training on the model’s own outputs. Actually, several cited works (e.g., STAR, REST-EM, RFT) rely on verifiable ground-truth rewards rather than purely self-judged signals. This distinction affects claims of novelty. 2. The title asks “Can large reasoning models self-train?” but the answer is effectively No. especially on harder datasets where base accurac

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · AI-based Problem Solving and Planning