How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains
Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, and Mohammad M. Ghassemi

TL;DR
This paper introduces RMCB, a comprehensive benchmark for evaluating confidence estimators in large reasoning models across high-stakes domains, revealing trade-offs and limitations of current methods.
Contribution
It provides the first large-scale, diverse benchmark for confidence estimation in LRMs and systematically evaluates multiple methods, highlighting their strengths and weaknesses.
Findings
Text-based encoders have higher discrimination ability.
Structurally-aware models have better calibration.
Increasing architectural complexity does not improve performance.
Abstract
The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
