Learning to Reason without External Rewards

Xuandong Zhao; Zhewei Kang; Aosong Feng; Sergey Levine; Dawn Song

arXiv:2505.19590·cs.LG·May 19, 2026

Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

PDF

1 Repo 14 Models 1 Video 3 Reviews

TL;DR

This paper introduces Intuitor, a novel reinforcement learning framework enabling large language models to learn complex reasoning tasks solely from their own confidence signals, eliminating the need for external supervision.

Contribution

It proposes a fully unsupervised learning method using self-certainty as the reward, demonstrating competitive performance and better generalization without external labels.

Findings

01

Intuitor matches performance of RLVR on mathematical benchmarks.

02

It achieves better out-of-domain generalization, such as code generation.

03

The approach eliminates reliance on costly external rewards or labeled data.

Abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

Originality: This paper is part of a wave of recent work attempting self-improvement for LLMs. These methods share a similar flavor, focusing on math/code tasks and using some type of heuristic—in this case, a KL divergence metric. The specific approach appears to be novel, though I am not confident about it. Quality: The idea is clear and the experimental analysis is quite thorough. Clarity: The writing is quite clear. Significance: The analysis is good. The insight regarding online versus o

Weaknesses

The main contribution of this work is the use of KL divergence against a uniform distribution as a proxy reward. I don't think there is enough theoretical justification or empirical evidence showing why this particular proxy reward is better than alternatives. The only comparison provided is with plurality voting, yet several other alternatives exist (missing references are linked below). The analysis of different model behaviors is interesting, but it is unclear whether these behaviors are due

Reviewer 02Rating 8Confidence 2

Strengths

+ Proposes a new RL paradigm (RLIF) that removes the dependency on external or verifiable rewards. It is an elegant and forward-looking idea for autonomous reasoning systems. + Using self-certainty as an intrinsic reward is well-motivated, mathematically grounded, and can be integrated into existing policy optimization frameworks. + Demonstrates solid performance across both reasoning and code tasks, with consistent improvements in generalization, instruction-following, and early learning spee

Weaknesses

I have 3 main weakness concerns for this paper: * The authors should argue way more to convince the reader on why self-certainty is a stable way of guiding the learning process, and won't lead agents to overconfidently learn shortcut behaviors. Having a random baseline that assigns "random confidence" would help a lot asserting the value of self-confidence. Also, providing second order metrics. * In the experimental evaluation section, providing list of scientific questions at the beginning of

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper extends RLIF by integrating self-certainty into GRPO for process-aware rewards using online computation to curb hacking, differing from outcome-focused RLVR. 2. The paper presents comprehensive experiments across families, with ablations rigorously contrasting entropy/random baselines to affirm stability, 3. The paper addresses a relevant question: scalable rewards without supervision for RLVR-limited domains, with execution showing INTUITOR's 13.8% OOD gains (e.g., LiveCodeBench)

Weaknesses

1. The approach replaces RLVR with self-certainty based reward, however it is questionable if the gains still hold when the models hallucinate, especially when the method is scaled to larger models. 2. The paper's experiments are confined to small models (1.5B-14B) and corpora (7.5k problems), raising doubts on scalability. 3. The novelty is modest: the self-certainty reward builds directly on self-certainty ides proposed in Kang et al. (2025) to replace RLVR in the GRPO formulation from DeepS

Code & Models

Repositories

sunblaze-ucb/Intuitor
github

Models

Videos

Learning to Reason without External Rewards· slideslive

Taxonomy

TopicsEpistemology, Ethics, and Metaphysics