Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering
Philip M\"uller, Nicholas Popovi\v{c}, Michael F\"arber, Peter Steinbach

TL;DR
This paper introduces a comprehensive benchmark to evaluate uncertainty quantification methods in large language models for scientific question answering, revealing significant limitations in current approaches and measurement practices.
Contribution
It provides the first large-scale, open-source framework for assessing UQ calibration in reasoning-intensive scientific QA across multiple models and datasets.
Findings
Instruction tuning causes probability mass polarization, reducing token-level confidence reliability.
Verbalized UQ approaches are biased and poorly correlated with correctness.
Answer frequency is the most reliable calibration method among those tested.
Abstract
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems
