Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs
Preetum Nakkiran, Arwen Bradley, Adam Goli\'nski, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson

TL;DR
This paper demonstrates that large language models can meaningfully assess their confidence in the semantic correctness of their responses, with calibration emerging as a natural byproduct of their training, and explores factors affecting this calibration.
Contribution
It provides a theoretical explanation for the emergence of semantic calibration in LLMs and validates this with empirical experiments across different training regimes.
Findings
Base LLMs are semantically calibrated in open-domain QA.
RL instruction-tuning reduces semantic calibration.
Chain-of-thought reasoning disrupts calibration.
Abstract
Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is a pleasure to read and provides a nice interplay of theory guided empirical experiments to show the existence of semantic calibration in base models. As I am not a theory person and don't have a background in multi-class calibration and its relation with loss functions, I am unable to assess the correctness of the main theorems. However, I was able to follow the intuition which broadly makes sense to me and connects well with existing literature. The main conceptual takeaway from t
I don't have any issues with material covered in the paper which is thoroughly presented. The authors have presented a solid work worthy of acceptance. The only drawback I see is the limited applicability of this theory and the exclusive focus on their B-confidence calibration which the authors acknowledge in Section B.1. While this work greatly extends our understanding of calibration on a semantic level for base LLMs, the semantic level is still limited to at most a sentence-level (as longer
1. Parameterizing calibration by an arbitrary collapsing function B is a neat, flexible formalism that connects sampling-based semantic confidence to established calibration literature. This lets the authors transparently say which semantic granularity they evaluate. 2. There is a solid theoretical contribution linking diverse prior work.The equivalence (Thm.6) between B-calibration and local-loss-optimality is a meaningful bridge from optimization theory to semantic calibration; Thm.9 provides
1. The chosen datasets primarily focus on an in-distribution assumption, which may limit the generality of the proposed method. 2. The claim "instruction-tuning breaks calibration" may require deeper analysis.
It introduces the notion of B-calibration—a general framework for defining calibration over arbitrary equivalence classes of outputs—which unifies token-level and semantic-level calibration under a single formalism. This perspective reframes an underexplored question (“can base LLMs meaningfully assess confidence in their answers’ meanings?”) into a rigorous, testable problem, providing a clear theoretical bridge between semantic uncertainty and local loss optimality. The work’s quality is hig
Although the authors present calibration as emerging from the ability to “predict one’s own semantic output distribution,” this remains correlational. The LoRA probe experiment (Claim 10) demonstrates correlation between learnability and calibration but not causation. The paper claims novelty in unifying calibration and loss-optimality, but related work in multi-calibration and conformal prediction is discussed mainly in the appendix. Bringing these connections into the main text—perhaps as a d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
