Full-ECE: A Metric For Token-level Calibration on Large Language Models
Han Liu, Yupeng Zhang, Bingning Wang, Weipeng Chen, Xiaolin Hu

TL;DR
This paper introduces Full-ECE, a new calibration metric designed specifically for large language models that assesses the entire probability distribution to improve uncertainty estimation accuracy.
Contribution
The paper proposes the concept of full calibration and develops the Full-ECE metric, addressing limitations of traditional calibration metrics for LLMs.
Findings
Full-ECE provides more accurate calibration assessment for LLMs.
Traditional ECE metrics are inadequate for models with large vocabularies.
Full-ECE captures the entire predicted probability distribution.
Abstract
Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
