Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models
Kyle Moore, Jesse Roberts, Daryl Watson

TL;DR
This paper evaluates how well inference-time uncertainty measures in large language models align with human uncertainty and traditional calibration metrics, highlighting measures that reflect human-like uncertainty and model correctness.
Contribution
It introduces novel evaluation methods for inference-time uncertainty and assesses their alignment with human uncertainty and model calibration.
Findings
Several uncertainty measures strongly align with human uncertainty.
Some measures demonstrate moderate to strong model calibration.
Alignment with human uncertainty does not necessarily imply preference alignment.
Abstract
There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of…
Peer Reviews
Decision·Submitted to ICLR 2026
- S1. Interesting angle on uncertainty quantification (UQ) research in LLMs, exploring whether the miscalibration of LLMs implies that models are in fact reflecting human uncertainty over answers. - S2. Overall well-written and organized. Figure 3 provides a summary view over the different MMLU subjects. - S3. Main results are backed by hypothesis testing (Figure 2 and Table 2).
- W1. **Limited novelty**: while the paper expands on the differences between prior work (in lines 82-88) it appears incremental (increasing number of datapoints and carrying calibration analysis). Perhaps the authors can highlight differences in findings or how their extended analysis to calibration differs from findings in prior work. - W2. The definition of “inference-time” is too broad and not sufficiently motivated: the arguments provided to narrow the experiments to logit-based approaches
- **Originality.** This paper explicitly targets **human-aligned** uncertainty (beyond correctness calibration) and connects **top-p** selection to Bayesian highest-density sets; it also proposes **JSD shift** as a distributional calibration diagnostic. - **Quality.** Careful separation of **alignment** vs **calibration**; broad metric family; multi-model evaluation; subject-wise analyses; multiple complementary criteria (correlation, ECE, JSD shift). - **Clarity.** Clear prompt template, da
- **Multiple-choice scope.** Evidence is limited to MCQ; the open-ended extension is conceptual and untested. - **Prompt/decoding sensitivity.** Alignment differs from prior “counterfactual prompting” studies; results may depend on the **cloze** template and decoding choices. - **Model coverage.** Only ≤8B open-weight models are included; larger/API models are referenced but not comprehensively stress-tested. - **Metric/threshold choices.** Heuristic thresholds (e.g., |r|≥0.3) and standard
The core problem of evaluating whether an LLM's uncertainty corresponds to human uncertainty is important for building more transparent and trustworthy AI systems. The methodology is sound and described clearly. The creation and use of a large-scale dataset from the Roper Center is a significant contribution. The finding that uncertainty alignment can exist independently of preference alignment is a particularly interesting and non-obvious result.
The study's primary limitation is 1. The models used are mainly small open-sourced non-reasoning models. Experiment on more diverse and larger models, including closed-source SOTA models such as GPT-4, Gemini, Claude etc., would strengthen the claims, as uncertainty measures are more relevant in widely deployed SOTA models. 2. The method mainly focuses on multiple-choice questions. While this is a necessary simplification, the true test of these uncertainty measures will be in open-ended genera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
