Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification
Jing Li

TL;DR
This study shows that the Area Under the ROC Curve (AUC) provides the most consistent evaluation of binary classification models across different data prevalences, outperforming other metrics in stability and ranking consistency.
Contribution
It demonstrates that AUC, which considers all decision thresholds, offers the most stable and reliable evaluation metric across varying data prevalences in binary classification.
Findings
AUC has the smallest variance in evaluating models across prevalence changes.
Metrics less influenced by prevalence provide more consistent model rankings.
Considering all decision thresholds reduces evaluation variance.
Abstract
The proper use of model evaluation metrics is important for model evaluation and model selection in binary classification tasks. This study investigates how consistent different metrics are at evaluating models across data of different prevalence while the relationships between different variables and the sample size are kept constant. Analyzing 156 data scenarios, 18 model evaluation metrics and five commonly used machine learning models as well as a naive random guess model, I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models. In particular, Area Under the ROC Curve (AUC) which takes all decision thresholds into account when evaluating models has the smallest variance in evaluating individual models and smallest variance in ranking of a set of models. A close…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare · Imbalanced Data Classification Techniques
MethodsSparse Evolutionary Training
