Beyond AUROC & co. for evaluating out-of-distribution detection performance
Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund

TL;DR
This paper critiques current OOD detection evaluation metrics like AUROC, highlighting their limitations, and introduces a new metric, AUTC, that better captures the separation quality between in-distribution and out-of-distribution samples.
Contribution
The paper proposes the Area Under the Threshold Curve (AUTC), a novel evaluation metric that addresses the shortcomings of existing metrics for OOD detection.
Findings
Current metrics like AUROC have limitations in evaluating OOD detection.
AUTC explicitly penalizes poor separation between ID and OOD samples.
The proposed metric provides a more practical evaluation of OOD detection methods.
Abstract
While there has been a growing research interest in developing out-of-distribution (OOD) detection methods, there has been comparably little discussion around how these methods should be evaluated. Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical needs. In this work, we take a closer look at the go-to metrics for evaluating OOD detection, and question the approach of exclusively reducing OOD detection to a binary classification task with little consideration for the detection threshold. We illustrate the limitations of current metrics (AUROC & its friends) and propose a new metric - Area Under the Threshold Curve (AUTC), which explicitly penalizes poor separation between ID and OOD samples. Scripts and data are available at https://github.com/glhr/beyond-auroc
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Data-Driven Disease Surveillance · Air Quality Monitoring and Forecasting
