MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty
Yongjin Yang, Haneul Yoo, Hwaran Lee

TL;DR
This paper introduces MAQA, a new dataset for evaluating uncertainty quantification in LLMs with data uncertainty, and assesses existing methods, revealing their varying effectiveness across tasks and highlighting entropy- and consistency-based approaches.
Contribution
The paper presents MAQA, a novel dataset for testing uncertainty quantification in realistic data scenarios, and evaluates five methods, providing insights into their performance under data uncertainty.
Findings
Previous methods struggle with data uncertainty compared to single-answer settings.
Entropy- and consistency-based methods effectively estimate uncertainty.
Performance varies depending on the task and method used.
Abstract
Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Measurement and Uncertainty Evaluation · Scientific Computing and Data Management · Fault Detection and Control Systems
