Maximizing Confidence Alone Improves Reasoning

Mihir Prabhudesai; Lili Chen; Alex Ippoliti; Katerina Fragkiadaki; Hao Liu; Deepak Pathak

arXiv:2505.22660·cs.LG·June 30, 2025

Maximizing Confidence Alone Improves Reasoning

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces RENT, an unsupervised reinforcement learning method that enhances reasoning in language models by maximizing the model's confidence through entropy minimization, without requiring external rewards.

Contribution

RENT is a novel unsupervised RL approach that improves reasoning by reinforcing high-confidence chains of thought based solely on the model's entropy, eliminating the need for external rewards.

Findings

01

Significant improvements on reasoning benchmarks like GSM8K and MATH500.

02

Effective across various model sizes and architectures.

03

Demonstrates broad applicability without external supervision.

Abstract

Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The proposed method requires no ground truth answer for training LLM with RL - The method is simple, clear and effective.

Weaknesses

- The motivation stems from that for open-ended tasks verifiable reward can be unavailable, however, all the test benchmarks in this paper have verifiable answers. It is unclear how the motivation is supported by the experiments in this paper. - The performance gap between standard RL and the proposed method is not provided.

Reviewer 02Rating 2Confidence 4

Strengths

(1) Originality: The paper is among the first to explore intrinsic rewards for reasoning, introducing the novel RENT framework that uses entropy minimization as an unsupervised reinforcement signal—removing the need for external supervision. (2) Quality: The experiments are thorough and well-executed across multiple benchmarks (GSM8K, MATH500, AMC, AIME, GPQA) and model families, with clear analyses showing that entropy (confidence) correlates strongly with reasoning accuracy and that RENT out

Weaknesses

(1) My main concern lies in the degree of novelty. While the paper is among the first to explore intrinsic rewards for reasoning, several concurrent works (e.g., [1,2]) have proposed similar ideas, and others ([3,4]) even adopt nearly identical entropy-based formulations. This overlap raises questions about whether the contribution remains sufficiently distinct and timely for acceptance. (2) Figures 2–4 are somewhat blurry and low in resolution, making them difficult to read. Additionally, the

Reviewer 03Rating 4Confidence 3

Strengths

- S1. The idea of minimizing entropy of token distributions is smart, though it has been explored in previous works. See Weakness 1 - S2. The proposed approach is architecture-agnostic and label-free, it can be easily applied in real-world or open-ended scenarios where external supervision is not available - S3. The paper provides insightful token-level analyses, revealing that reducing entropy near the end of the reasoning process correlates most strongly with accuracy

Weaknesses

- W1. Limited technical novelty and lack of baselines. First, the idea of maximizing confidence for improving reasoning capability it not new and has been explore in previous works [1,2,3] and many of those are not discussed in the paper. Second, the experiments only consider other unsupervised methods as baselines, but not standard RLVR (e.g., standard GRPO with correctness reward). Showing how close RENT comes to standard RLVR (as a fraction of the performance gap) would help position its empi

Code & Models

Models

🤗
aippolit/RENT-Qwen-7B
model· 6 dl· ♡ 1
6 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)