How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del; Markus K\"angsepp; Marharyta Domnich; Ardi Tampuu; Lisa Yankovskaya; Meelis Kull; Mark Fishel

arXiv:2603.19118·cs.AI·March 20, 2026

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus K\"angsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel

PDF

Open Access

TL;DR

This paper investigates how uncertainty estimation methods like self-consistency and verbalized confidence scale with sampling in reasoning models across various domains, revealing that combining signals enhances uncertainty quality and domain-dependent differences exist.

Contribution

It provides a comprehensive analysis of uncertainty signal scaling in reasoning models, highlighting the benefits of hybrid estimators and domain-specific behaviors.

Findings

01

Self-consistency and verbalized confidence scale with sampling.

02

Hybrid estimators improve AUROC by up to 12 points.

03

Mathematics domain shows stronger uncertainty signals and faster scaling.

Abstract

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+ 12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications