Mitigating Bias in Calibration Error Estimation
Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, Michael C. Mozer

TL;DR
This paper investigates statistical bias in calibration error estimation for AI systems, proposing improved estimators that enhance calibration assessment and model reliability.
Contribution
It introduces a framework for bias assessment, identifies better estimators like ECE_sweep, and demonstrates their effectiveness in calibration evaluation.
Findings
Equal-mass binning reduces bias compared to equal-width binning.
The proposed ECE_sweep estimator improves calibration detection.
Debiased estimator and ECE_sweep outperform traditional methods.
Abstract
For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE_bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
