Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang; Subre Abdoul Moktar; Jia Li; Kangshuo Li; Feng Chen

arXiv:2511.03166·cs.CL·November 6, 2025

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen

PDF

Open Access

TL;DR

This paper conducts an extensive empirical evaluation of various uncertainty estimation methods in Large Language Models, analyzing their effectiveness in in-distribution and out-of-distribution question-answering tasks to enhance trustworthiness.

Contribution

It provides a comprehensive comparison of twelve uncertainty estimation techniques across multiple metrics and datasets, highlighting their strengths and limitations in different contexts.

Findings

01

Information-based methods excel in ID settings.

02

Density-based methods perform better in OOD scenarios.

03

Semantic consistency methods show reliable performance across datasets.

Abstract

Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Computational and Text Analysis Methods