Label-free estimation of clinically relevant performance metrics under distribution shifts
Tim Fl\"uhmann, Alceu Bissoto, Trung-Dung Hoang, Lisa M. Koch

TL;DR
This paper introduces methods to estimate full confusion matrices for medical image classifiers without labels, evaluates them on chest X-ray data under distribution shifts, and highlights their reliability and limitations in clinical settings.
Contribution
It generalizes existing performance prediction methods to estimate full confusion matrices and benchmarks their effectiveness on real-world and simulated distribution shifts in medical imaging.
Findings
Confusion matrix estimation reliably predicts clinical metrics under shifts.
Current methods have failure modes revealed by simulated shifts.
Performance monitoring techniques need better understanding for clinical deployment.
Abstract
Performance monitoring is essential for safe clinical deployment of image classification models. However, because ground-truth labels are typically unavailable in the target dataset, direct assessment of real-world model performance is infeasible. State-of-the-art performance estimation methods address this by leveraging confidence scores to estimate the target accuracy. Despite being a promising direction, the established methods mainly estimate the model's accuracy and are rarely evaluated in a clinical domain, where strong class imbalances and dataset shifts are common. Our contributions are twofold: First, we introduce generalisations of existing performance prediction methods that directly estimate the full confusion matrix. Then, we benchmark their performance on chest x-ray data in real-world distribution shifts as well as simulated covariate and prevalence shifts. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
