Overcoming Common Flaws in the Evaluation of Selective Classification   Systems

Jeremias Traub; Till J. Bungert; Carsten T. L\"uth; Michael; Baumgartner; Klaus H. Maier-Hein; Lena Maier-Hein; Paul F Jaeger

arXiv:2407.01032·cs.LG·October 22, 2024

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Jeremias Traub, Till J. Bungert, Carsten T. L\"uth, Michael, Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, Paul F Jaeger

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new evaluation metric, AUGRC, for selective classification systems that better captures overall performance across multiple thresholds, addressing limitations of existing metrics.

Contribution

The paper defines five key requirements for multi-threshold metrics in selective classification and proposes AUGRC, a new metric that fulfills these requirements and improves benchmarking.

Findings

01

AUGRC meets all five proposed requirements.

02

It significantly alters performance rankings on most datasets.

03

Empirical results demonstrate its effectiveness across diverse datasets.

Abstract

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $AUROC$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( $AUGRC$ ), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iml-dkfz/fd-shifts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and Computational Modeling