Empirical AUC for evaluating probabilistic forecasts
Simon Byrne

TL;DR
This paper examines the use of empirical AUC as a scoring function for probabilistic forecasts, revealing its limitations in being a proper scoring rule and proposing conditions for its proper use.
Contribution
It analyzes the propriety of the empirical AUC as a scoring function and suggests modifications to ensure proper evaluation of probabilistic forecasts.
Findings
AUC is not generally a proper scoring function
Modifications can make AUC proper under certain conditions
Empirical AUC can be improved by adjusting probabilities
Abstract
Scoring functions are used to evaluate and compare partially probabilistic forecasts. We investigate the use of rank-sum functions such as empirical Area Under the Curve (AUC), a widely-used measure of classification performance, as a scoring function for the prediction of probabilities of a set of binary outcomes. It is shown that the AUC is not generally a proper scoring function, that is, under certain circumstances it is possible to improve on the expected AUC by modifying the quoted probabilities from their true values. However with some restrictions, or with certain modifications, it can be made proper.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
