How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Reza Khanmohammadi; Erfan Miahi; Simerjot Kaur; Ivan Brugere; Charese H. Smiley; Kundan Thind; and Mohammad M. Ghassemi

arXiv:2601.08134·cs.CL·January 22, 2026

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, and Mohammad M. Ghassemi

PDF

Open Access 1 Video

TL;DR

This paper introduces RMCB, a comprehensive benchmark for evaluating confidence estimators in large reasoning models across high-stakes domains, revealing trade-offs and limitations of current methods.

Contribution

It provides the first large-scale, diverse benchmark for confidence estimation in LRMs and systematically evaluates multiple methods, highlighting their strengths and weaknesses.

Findings

01

Text-based encoders have higher discrimination ability.

02

Structurally-aware models have better calibration.

03

Increasing architectural complexity does not improve performance.

Abstract

The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains· underline

Taxonomy

TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education