Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Qianxi He; Qingyu Ren; Shanzhe Lei; Xuhong Wang; Yingchun Wang

arXiv:2511.07483·cs.AI·November 12, 2025

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a confidence-aware reward model that improves reasoning quality in large language models by penalizing low-confidence correct answers, leading to more consistent and logical STEM reasoning.

Contribution

The paper presents a novel confidence-based reward model that enhances reasoning in LLMs by addressing limitations of traditional rule-based reward methods.

Findings

01

Outperforms state-of-the-art reward models on STEM benchmarks

02

Promotes more robust and logically consistent reasoning

03

Effective in both static and RL training evaluations

Abstract

Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, numerous technical reports indicate that purely rule-based reward RL frequently results in poor-quality reasoning chains or inconsistencies between reasoning processes and final answers, particularly when the base model is of smaller scale. During the RL exploration process, models might employ low-quality reasoning chains due to the lack of knowledge, occasionally producing correct answers randomly and receiving rewards based on established rule-based judges. This constrains the potential for resource-limited organizations to conduct direct reinforcement learning training on smaller-scale models. We propose a novel confidence-based reward model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)