Measuring Calibration in Deep Learning

Jeremy Nixon; Mike Dusenberry; Ghassen Jerfel; Timothy Nguyen,; Jeremiah Liu; Linchuan Zhang; Dustin Tran

arXiv:1904.01685·cs.LG·August 11, 2020·156 cites

Measuring Calibration in Deep Learning

Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen,, Jeremiah Liu, Linchuan Zhang, Dustin Tran

PDF

Open Access 1 Repo

TL;DR

This paper thoroughly investigates calibration measures in deep learning, revealing how different choices impact evaluation outcomes and providing recommendations for more reliable calibration assessment.

Contribution

It offers a comprehensive empirical analysis of calibration measurement choices and proposes best practices, including adaptive binning and class conditioning, to improve calibration evaluation.

Findings

01

Calibration measure choices significantly affect method rankings.

02

Class conditioning improves calibration evaluation accuracy.

03

Adaptive binning enhances stability across different bin counts.

Abstract

Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ENSTA-U2IS-AI/torch-uncertainty
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning