TL;DR
This paper introduces a calibration method for precision-based metrics like F1-score and AUC-PR, making them invariant to class prior and improving interpretability across subpopulations and periods.
Contribution
It proposes a novel calibration approach for metrics to enhance their interpretability and applicability in real-world model evaluation scenarios.
Findings
Calibrated metrics are less dependent on class prior.
Improved interpretability of model performance over subpopulations.
Enhanced control over what is measured in model evaluation.
Abstract
Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
