Measures of predictive accuracy, miscalibration and discrimination
{\L}ukasz Delong, Mario W\"uthrich

TL;DR
This paper analyzes evaluation metrics for point predictors, highlighting limitations of popular scores like ABC and Gini, and advocates for mean-consistent loss functions for honest model assessment.
Contribution
It introduces a new Murphy's decomposition variant, relates it to Lorenz-based measures, and demonstrates the shortcomings of existing scores for model evaluation.
Findings
ABC and ABC$^2$ depend on predictor-dependent weights, leading to biased evaluation.
Gini score also fails to align with mean-consistent scoring functions.
Using mean-consistent loss functions improves the honesty of model comparisons.
Abstract
We study the evaluation of real-valued point predictors under the decision-theoretic framework of mean-consistent loss functions given by the Bregman divergences. We first derive a new version of Murphy's decomposition of the expected loss which does not directly include the response itself but only its predictors. We then relate the miscalibration and the discrimination component of the Murphy's decomposition to Lorenz-curve-based accuracy measures that are widely used in practice. Besides the usual area between the concentration and Lorenz curves, ABC, we introduce a mean-squared version ABC that mitigates some of the weaknesses of the original ABC in identifying mean-calibration. More importantly, both ABC and ABC are shown to rely on predictor-dependent weights, so they fail to align with the class of mean-consistent scoring functions. In the same spirit, we derive a similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
