Evaluating AI systems under uncertain ground truth: a case study in dermatology
David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana, Matejovicova, Melih Barsbey, Patricia Strachan, Mike Schaekermann, Jan, Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang,, Dale R. Webster, Yuan Liu, Greg S. Corrado, Yossi Matias

TL;DR
This paper highlights the importance of accounting for uncertainty in ground truth when evaluating medical AI systems, proposing a statistical method to better estimate true performance and risk.
Contribution
It introduces a novel aggregation approach that models ground truth uncertainty from multiple expert annotations, improving evaluation accuracy in medical diagnosis AI.
Findings
Standard evaluation overestimates model performance without uncertainty consideration.
Ground truth uncertainty significantly affects performance metrics in dermatology datasets.
Our method provides more realistic performance estimates with uncertainty quantification.
Abstract
For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, this ground truth is often curated in the form of differential diagnoses. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Advanced Statistical Methods and Models
MethodsInvertible Rescaling Network
