Full-ECE: A Metric For Token-level Calibration on Large Language Models

Han Liu; Yupeng Zhang; Bingning Wang; Weipeng Chen; Xiaolin Hu

arXiv:2406.11345·cs.CL·June 18, 2024

Full-ECE: A Metric For Token-level Calibration on Large Language Models

Han Liu, Yupeng Zhang, Bingning Wang, Weipeng Chen, Xiaolin Hu

PDF

Open Access

TL;DR

This paper introduces Full-ECE, a new calibration metric designed specifically for large language models that assesses the entire probability distribution to improve uncertainty estimation accuracy.

Contribution

The paper proposes the concept of full calibration and develops the Full-ECE metric, addressing limitations of traditional calibration metrics for LLMs.

Findings

01

Full-ECE provides more accurate calibration assessment for LLMs.

02

Traditional ECE metrics are inadequate for models with large vocabularies.

03

Full-ECE captures the entire predicted probability distribution.

Abstract

Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling