Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering; Michael Krumdick; Viet Dac Lai; Seth Ebner; Nilesh; Kumar; Varshini Reddy; Rik Koncel-Kedziorski; Chris Tanner

arXiv:2410.16007·cs.AI·March 6, 2025

Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh, Kumar, Varshini Reddy, Rik Koncel-Kedziorski, Chris Tanner

PDF

Open Access

TL;DR

This paper investigates whether language models accurately reflect the probabilities of numeric and categorical options in context, finding they are poorly calibrated and exhibit systematic biases influenced by artifacts like word order and frequency.

Contribution

It provides the first systematic analysis of language model calibration in numeric contexts, revealing significant biases and calibration issues even in simple settings.

Findings

01

Language models are poorly calibrated in numeric contexts.

02

Systematic biases are influenced by word order, identity, and frequency.

03

Models do not proportionally assign probabilities to options as expected.

Abstract

Some statements have one well-defined continuation (e.g., "the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., "the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFLIP