Calibrating Expressions of Certainty
Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi,, Rameswar Panda, William M. Wells, Tina Kapur, Polina Golland

TL;DR
This paper introduces a distribution-based approach to calibrate linguistic expressions of certainty, improving the accuracy of how humans and models express confidence levels and providing methods to enhance their calibration.
Contribution
It proposes modeling certainty phrases as distributions over the simplex, generalizes miscalibration measures, and offers a new post-hoc calibration technique for better interpretability.
Findings
Improved calibration of human and model certainty expressions
Distributional representation captures semantics more accurately
Provides actionable suggestions for calibration enhancement
Abstract
We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.
Peer Reviews
Decision·ICLR 2025 Poster
- By considering confidence as a probabilistic simplex, the formulation of calibration as an instance of optimal transport is interesting.
- <s>The experiments are performed over a self-curated dataset whose details are not aptly described. Maybe more details on the dataset is preferred. Additionally, there are public datasets that could be used for these experiments (e.g. https://nlp.jhu.edu/unli/, for natural language inference). As the authors proposed a general calibration method, I believe that datasets with diverse settings should be considered.</s> - Some experimental setup are unclear. See below. - Table 1: The proposed m
- Interesting proposal that uncertainty phrases could be associated with a distribution instead of a single confidence score. - Very clear presentation of the method and insightful comparison against existing work.
- The link to uncertainty phrases seems superficial, as they serve merely as names for distributions without strong justification for this treatment. - The motivation to represent uncertainty phrases as distributions is unclear; individual users often interpret these phrases in consistent orders, making scalar values (or simple binning) potentially sufficient. At a population level, it’s unclear why uncertainty phrases should align with complex empirical distributions rather than a straightforwa
- A good understanding and presentation of the existing literature on calibration. - Using confidence distributions makes the proposed estimator more robust to increasing the number of bins. - Propose a novel framing of calibration as a composition of source confidence distributions, an optimal transport map to target confidence distributions, and an indexing function. - Provides potential real-world use cases, such as calibrating human expressions of uncertainty in medicine and LLM expressions
- The paper focuses on binary classification, which could make its impact limited. - Conversely, the relative complexity of the approach to non-statisticians could limit its adoption. - The candidate confidence distributions could in practice be very large, which could make some of the optimization problems very slow to solve.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Bayesian Modeling and Causal Inference
