On Calibration of Large Language Models: From Response To Capability
Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun

TL;DR
This paper introduces capability calibration for large language models, focusing on estimating the likelihood of solving a query overall rather than response correctness, improving practical confidence assessments.
Contribution
It formally distinguishes capability calibration from response calibration and demonstrates its effectiveness in improving model confidence and inference efficiency.
Findings
Capability calibration better predicts overall query success.
It improves pass@$k$ prediction accuracy.
Enhances inference budget allocation strategies.
Abstract
Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Healthcare and Education
