Evaluating AI systems under uncertain ground truth: a case study in   dermatology

David Stutz; Ali Taylan Cemgil; Abhijit Guha Roy; Tatiana; Matejovicova; Melih Barsbey; Patricia Strachan; Mike Schaekermann; Jan; Freyberg; Rajeev Rikhye; Beverly Freeman; Javier Perez Matos; Umesh Telang,; Dale R. Webster; Yuan Liu; Greg S. Corrado; Yossi Matias; Pushmeet Kohli; Yun; Liu; Arnaud Doucet; Alan Karthikesalingam

arXiv:2307.02191·cs.LG·April 15, 2025·1 cites

Evaluating AI systems under uncertain ground truth: a case study in dermatology

David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana, Matejovicova, Melih Barsbey, Patricia Strachan, Mike Schaekermann, Jan, Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang,, Dale R. Webster, Yuan Liu, Greg S. Corrado, Yossi Matias

PDF

Open Access 1 Repo

TL;DR

This paper highlights the importance of accounting for uncertainty in ground truth when evaluating medical AI systems, proposing a statistical method to better estimate true performance and risk.

Contribution

It introduces a novel aggregation approach that models ground truth uncertainty from multiple expert annotations, improving evaluation accuracy in medical diagnosis AI.

Findings

01

Standard evaluation overestimates model performance without uncertainty consideration.

02

Ground truth uncertainty significantly affects performance metrics in dermatology datasets.

03

Our method provides more realistic performance estimates with uncertainty quantification.

Abstract

For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, this ground truth is often curated in the form of differential diagnoses. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/uncertain_ground_truth
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Advanced Statistical Methods and Models

MethodsInvertible Rescaling Network