Performance evaluation of predictive AI models to support medical   decisions: Overview and guidance

Ben Van Calster; Gary S. Collins; Andrew J. Vickers; Laure Wynants,; Kathleen F. Kerr; Lasai Barre\~nada; Gael Varoquaux; Karandeep Singh; Karel; G. M. Moons; Tina Hernandez-boussard; Dirk Timmerman; David J. Mclernon,; Maarten Van Smeden; Ewout W. Steyerberg (topic group 6 of the STRATOS; initiative)

arXiv:2412.10288·cs.LG·December 16, 2024

Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

Ben Van Calster, Gary S. Collins, Andrew J. Vickers, Laure Wynants,, Kathleen F. Kerr, Lasai Barre\~nada, Gael Varoquaux, Karandeep Singh, Karel, G. M. Moons, Tina Hernandez-boussard, Dirk Timmerman, David J. Mclernon,, Maarten Van Smeden

PDF

TL;DR

This paper reviews and guides the selection of performance measures for binary predictive AI models in medicine, emphasizing proper measures and graphical assessments to ensure safe and effective clinical decision support.

Contribution

It provides a comprehensive evaluation of 32 performance measures, highlighting their properties and recommending key measures and plots for medical AI validation.

Findings

01

17 measures are both proper and decision-analytic

02

Classification accuracy and F1 are improper for clinical thresholds

03

Recommended measures include AUROC, calibration plot, and net benefit

Abstract

A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus