Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Christian Hobelsberger; Theresa Winner; Andreas Nawroth; Oliver Mitevski; Anna-Carolina Haensch

arXiv:2510.20460·cs.CL·October 24, 2025

Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Christian Hobelsberger, Theresa Winner, Andreas Nawroth, Oliver Mitevski, Anna-Carolina Haensch

PDF

Open Access

TL;DR

This paper systematically evaluates four uncertainty estimation methods in large language models across multiple question-answering tasks, finding that the hybrid CoCoA approach offers the best overall reliability and calibration.

Contribution

It provides a comprehensive comparison of four confidence estimation methods in LLMs and introduces insights into their relative effectiveness and trade-offs.

Findings

01

CoCoA outperforms individual metrics in reliability

02

Different metrics capture distinct confidence facets

03

Hybrid approach improves calibration and discrimination

Abstract

Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification