Calibrating Expressions of Certainty

Peiqi Wang; Barbara D. Lam; Yingcheng Liu; Ameneh Asgari-Targhi,; Rameswar Panda; William M. Wells; Tina Kapur; Polina Golland

arXiv:2410.04315·cs.CL·April 3, 2025

Calibrating Expressions of Certainty

Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi,, Rameswar Panda, William M. Wells, Tina Kapur, Polina Golland

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a distribution-based approach to calibrate linguistic expressions of certainty, improving the accuracy of how humans and models express confidence levels and providing methods to enhance their calibration.

Contribution

It proposes modeling certainty phrases as distributions over the simplex, generalizes miscalibration measures, and offers a new post-hoc calibration technique for better interpretability.

Findings

01

Improved calibration of human and model certainty expressions

02

Distributional representation captures semantics more accurately

03

Provides actionable suggestions for calibration enhancement

Abstract

We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 2

Strengths

- By considering confidence as a probabilistic simplex, the formulation of calibration as an instance of optimal transport is interesting.

Weaknesses

- <s>The experiments are performed over a self-curated dataset whose details are not aptly described. Maybe more details on the dataset is preferred. Additionally, there are public datasets that could be used for these experiments (e.g. https://nlp.jhu.edu/unli/, for natural language inference). As the authors proposed a general calibration method, I believe that datasets with diverse settings should be considered.</s> - Some experimental setup are unclear. See below. - Table 1: The proposed m

Reviewer 02Rating 6Confidence 4

Strengths

- Interesting proposal that uncertainty phrases could be associated with a distribution instead of a single confidence score. - Very clear presentation of the method and insightful comparison against existing work.

Weaknesses

- The link to uncertainty phrases seems superficial, as they serve merely as names for distributions without strong justification for this treatment. - The motivation to represent uncertainty phrases as distributions is unclear; individual users often interpret these phrases in consistent orders, making scalar values (or simple binning) potentially sufficient. At a population level, it’s unclear why uncertainty phrases should align with complex empirical distributions rather than a straightforwa

Reviewer 03Rating 8Confidence 3

Strengths

- A good understanding and presentation of the existing literature on calibration. - Using confidence distributions makes the proposed estimator more robust to increasing the number of bins. - Propose a novel framing of calibration as a composition of source confidence distributions, an optimal transport map to target confidence distributions, and an indexing function. - Provides potential real-world use cases, such as calibrating human expressions of uncertainty in medicine and LLM expressions

Weaknesses

- The paper focuses on binary classification, which could make its impact limited. - Conversely, the relative complexity of the approach to non-statisticians could limit its adoption. - The candidate confidence distributions could in practice be very large, which could make some of the optimization problems very slow to solve.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Bayesian Modeling and Causal Inference