On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang; Cheng-Kuang Wu; Chieh-Yen Lin; Yun-Nung Chen; Hung-yi Lee; Shao-Hua Sun

arXiv:2602.13540·cs.CL·February 17, 2026

On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun

PDF

Open Access

TL;DR

This paper introduces capability calibration for large language models, focusing on estimating the likelihood of solving a query overall rather than response correctness, improving practical confidence assessments.

Contribution

It formally distinguishes capability calibration from response calibration and demonstrates its effectiveness in improving model confidence and inference efficiency.

Findings

01

Capability calibration better predicts overall query success.

02

It improves pass@$k$ prediction accuracy.

03

Enhances inference budget allocation strategies.

Abstract

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Healthcare and Education