Overcoming Common Flaws in the Evaluation of Selective Classification Systems
Jeremias Traub, Till J. Bungert, Carsten T. L\"uth, Michael, Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, Paul F Jaeger

TL;DR
This paper introduces a new evaluation metric, AUGRC, for selective classification systems that better captures overall performance across multiple thresholds, addressing limitations of existing metrics.
Contribution
The paper defines five key requirements for multi-threshold metrics in selective classification and proposes AUGRC, a new metric that fulfills these requirements and improves benchmarking.
Findings
AUGRC meets all five proposed requirements.
It significantly alters performance rankings on most datasets.
Empirical results demonstrate its effectiveness across diverse datasets.
Abstract
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve (), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling
