Trust, or Don't Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation

Kourosh Shahnazari; Seyed Moein Ayyoubzadeh; Mohammadali Keshtparvar; Pegah Ghaffari

arXiv:2505.18622·cs.LG·May 27, 2025

Trust, or Don't Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, Pegah Ghaffari

PDF

Open Access

TL;DR

This paper introduces the CWSA and CWSA+ metrics for evaluating confidence-aware models, explicitly rewarding correct confident predictions and penalizing overconfident errors, improving trust assessment in machine learning systems.

Contribution

The paper proposes novel, interpretable metrics CWSA and CWSA+ that better evaluate model reliability under confidence thresholds compared to traditional metrics.

Findings

01

CWSA and CWSA+ outperform classical metrics in detecting failure modes.

02

Metrics effectively distinguish between calibrated, overconfident, and underconfident models.

03

CWSA provides a reliable basis for safety-critical model evaluation.

Abstract

In recent machine learning systems, confidence scores are being utilized more and more to manage selective prediction, whereby a model can abstain from making a prediction when it is unconfident. Yet, conventional metrics like accuracy, expected calibration error (ECE), and area under the risk-coverage curve (AURC) do not capture the actual reliability of predictions. These metrics either disregard confidence entirely, dilute valuable localized information through averaging, or neglect to suitably penalize overconfident misclassifications, which can be particularly detrimental in real-world systems. We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+ that offer a principled and interpretable way to evaluate predictive models under confidence thresholds. Unlike existing methods, our metrics explicitly reward confident accuracy and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Data Quality and Management · Anomaly Detection Techniques and Applications