A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Zhanliang Wang, Jiancong Xiao, Ruochen Jin, Shu Yang, Bojian Hou, Li Shen

TL;DR
This paper introduces Sem-ECE, a novel framework for evaluating calibration in open-ended question answering by sampling answers, grouping them semantically, and assessing confidence, addressing limitations of existing methods.
Contribution
The paper proposes Sem-ECE, a new semantic-sampling calibration evaluation method for open-ended QA, with proven unbiased estimators and improved accuracy over existing approaches.
Findings
Sem-ECE outperforms verbalized confidence and existing sampling methods.
Sem$_2$-ECE achieves smaller calibration error on hard questions.
Theoretical analysis confirms estimators are asymptotically unbiased.
Abstract
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
