Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

Philip M\"uller; Nicholas Popovi\v{c}; Michael F\"arber; Peter Steinbach

arXiv:2602.00279·cs.CL·February 3, 2026

Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

Philip M\"uller, Nicholas Popovi\v{c}, Michael F\"arber, Peter Steinbach

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark to evaluate uncertainty quantification methods in large language models for scientific question answering, revealing significant limitations in current approaches and measurement practices.

Contribution

It provides the first large-scale, open-source framework for assessing UQ calibration in reasoning-intensive scientific QA across multiple models and datasets.

Findings

01

Instruction tuning causes probability mass polarization, reducing token-level confidence reliability.

02

Verbalized UQ approaches are biased and poorly correlated with correctness.

03

Answer frequency is the most reliable calibration method among those tested.

Abstract

Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems