BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, David A. Clifton

TL;DR
The paper introduces BAS, a decision-theoretic metric for evaluating LLM confidence in abstention scenarios, revealing significant overconfidence issues not captured by standard metrics.
Contribution
It proposes BAS, a novel utility-based metric for assessing LLM confidence, and provides a benchmark showing variability and overconfidence in current models.
Findings
Larger models tend to have higher BAS scores.
Standard metrics like ECE and AURC can be misleading about confidence reliability.
Simple calibration methods can improve LLM confidence accuracy.
Abstract
Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
