Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration
Luciana Ferrer, Daniel Ramos

TL;DR
This paper advocates for using proper scoring rules (PSRs) over calibration metrics like ECE to evaluate the quality of posterior probabilities in machine learning, emphasizing the importance of discrimination performance.
Contribution
The paper provides a theoretical and empirical analysis demonstrating that expected PSRs are more appropriate than calibration metrics for assessing posterior quality, and introduces a practical calibration loss metric.
Findings
Expected PSRs are principled measures of posterior quality.
Calibration metrics like ECE are insufficient for performance evaluation.
Calibration loss outperforms ECE and score divergence as a diagnostic tool.
Abstract
Most machine learning classifiers are designed to output posterior probabilities for the classes given the input sample. These probabilities may be used to make the categorical decision on the class of the sample; provided as input to a downstream system; or provided to a human for interpretation. Evaluating the quality of the posteriors generated by these system is an essential problem which was addressed decades ago with the invention of proper scoring rules (PSRs). Unfortunately, much of the recent machine learning literature uses calibration metrics -- most commonly, the expected calibration error (ECE) -- as a proxy to assess posterior performance. The problem with this approach is that calibration metrics reflect only one aspect of the quality of the posteriors, ignoring the discrimination performance. For this reason, we argue that calibration metrics should play no role in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications
