Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Christian Hobelsberger, Theresa Winner, Andreas Nawroth, Oliver Mitevski, Anna-Carolina Haensch

TL;DR
This paper systematically evaluates four uncertainty estimation methods in large language models across multiple question-answering tasks, finding that the hybrid CoCoA approach offers the best overall reliability and calibration.
Contribution
It provides a comprehensive comparison of four confidence estimation methods in LLMs and introduces insights into their relative effectiveness and trade-offs.
Findings
CoCoA outperforms individual metrics in reliability
Different metrics capture distinct confidence facets
Hybrid approach improves calibration and discrimination
Abstract
Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
