Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani; Isha Puri; Stewart Slocum; Idan Shenfeld; Leshem Choshen; Yoon Kim; Jacob Andreas

arXiv:2507.16806·cs.LG·May 18, 2026

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

PDF

1 Repo 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces RLCR, a reinforcement learning approach that jointly improves the accuracy and calibration of language models' confidence estimates during reasoning tasks.

Contribution

RLCR is a novel training method that incorporates calibration rewards to produce more reliable and well-calibrated language models without sacrificing accuracy.

Findings

01

RLCR improves calibration across diverse datasets.

02

RLCR maintains accuracy while enhancing confidence calibration.

03

Verbalized confidence can be used to further improve model reliability.

Abstract

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper tackles a timely and practically-relevant problem supported by a fair amount of experiments. - Overall, the paper is clearly written and easy to follow.

Weaknesses

- **Model coverage** The paper only tests Qwen2.5-7B, and additional models, maybe a non-reasoning model, should be included to demonstrate that RLCR generalizes beyond a single model. I must state that this is not a mere superficial comment rather one of my major concerns. - **Fair evaluation** Page 27 notes that RLCR produces shorter `<think>` sections than RLVR. It is unclear whether the results were evaluated under equal inference budgets, and this should be explicitly specified to ensure fa

Reviewer 02Rating 6Confidence 3

Strengths

- The paper tackles a well-known yet underexplored issue in RL-based reasoning: overconfidence induced by correctness-only rewards. This concern has also been raised in recent works such as *“Why Language Models Hallucinate” (OpenAI, 2025)*, but practical algorithmic remedies have been lacking. - The proposed modification is extremely intuitive—adding a proper scoring rule (Brier score) term to the RL objective—and is supported by clean theoretical analysis showing that it jointly optimizes acc

Weaknesses

1. **(Minor) Inconsistency in the definition of proper scoring rules.** - Equation (4) defines a proper scoring rule as one whose expected value is minimized. - However, the examples mix loss and reward conventions: - The Brier score (Eq. 6) is defined as a loss (minimized). - The logarithmic (Eq. 5) and spherical (Eq. 7) scores are utility functions (maximized). - The statement that “all these scores... are maximized” (line 144) contradicts both Eq. (4) and the Brier

Reviewer 03Rating 6Confidence 4

Strengths

* The method is simple and intuitive, and shows potential and effectiveness in improving both task accuracy and calibration. * The reward design and the choice of score rule are well-motivated and theoretically justified. * Uncertainty-aware reasoning and RL training for calibration is a timely and relatively under-explored topic.

Weaknesses

- The paper should make it clearer the contributions and benefits compared with previous RL-based calibration methods cited in related works [1,2,3]. Given the overlap in scope, it would be very useful to include these as baselines potentially stronger than the RLVR variants. Also, given the recent debate on the contamination issue of Qwen-family models in RLVR, it would be helpful to test on other models as well. - It would be good to isolate / disentangle the role of the Brier term and uncerta

Code & Models

Repositories

https://rl-calibration.github.io
github

Datasets

mehuldamani/hotpot_qa
dataset· 1.2k dl
1.2k dl

Videos

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning