Measuring Calibration in Deep Learning
Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen,, Jeremiah Liu, Linchuan Zhang, Dustin Tran

TL;DR
This paper thoroughly investigates calibration measures in deep learning, revealing how different choices impact evaluation outcomes and providing recommendations for more reliable calibration assessment.
Contribution
It offers a comprehensive empirical analysis of calibration measurement choices and proposes best practices, including adaptive binning and class conditioning, to improve calibration evaluation.
Findings
Calibration measure choices significantly affect method rankings.
Class conditioning improves calibration evaluation accuracy.
Adaptive binning enhances stability across different bin counts.
Abstract
Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
