Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals

Pascaline Andr\'e (1); Charles Heitz (1); Evangelia Christodoulou (2; 5; 6); Annika Reinke (2; 4); Carole H. Sudre (3; 7; 8); Michela Antonelli (7; 8); Patrick Godau (2; 5); M. Jorge Cardoso (7); Antoine Gilson (1); Sophie Tezenas du Montcel (1); Ga\"el Varoquaux (9); Lena Maier-Hein (2; 4; 5; 10; 11); Olivier Colliot (1) ((1) Sorbonne Universit\'e; Institut du Cerveau - Paris Brain Institute - ICM; CNRS; Inria; Inserm; AP-HP; H\^opital de la Piti\'e-Salp\^etri\`ere; F-75013; Paris; France (2) German Cancer Research Center (DKFZ) Heidelberg; Division of Intelligent Medical Systems; Germany (3) Unit for Lifelong Health; Ageing at UCL; Department of Population Science; Experimental Medicine; Hawkes InstituteCentre for Medical Image Computing; Department of Computer Science; University College London; UK (4) DKFZ Heidelberg; Helmholtz Imaging; Germany (5) National Center for Tumor Diseases (NCT); NCT Heidelberg; a partnership between DKFZ; Heidelberg University Hospital; Germany (6) AI Health Innovation Cluster; Germany (7) School of Biomedical Engineering; Imaging Science; King's College London; UK (8) Hawkes Institute; Department of Computer Science; University College London; UK (9) SODA project team; Inria Saclay-\^Ile-de-France; France (10) Faculty of Mathematics; Computer Science; Heidelberg University; Germany (11) Medical Faculty; Heidelberg University; Germany)

arXiv:2601.17103·cs.CV·January 27, 2026

Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals

Pascaline Andr\'e (1), Charles Heitz (1), Evangelia Christodoulou (2, 5, 6), Annika Reinke (2, 4), Carole H. Sudre (3, 7, 8), Michela Antonelli (7, 8), Patrick Godau (2, 5), M. Jorge Cardoso (7), Antoine Gilson (1), Sophie Tezenas du Montcel (1), Ga\"el Varoquaux (9)

PDF

Open Access

TL;DR

This large-scale study evaluates the reliability and precision of various confidence interval methods across diverse medical imaging tasks, revealing key factors influencing performance uncertainty quantification essential for clinical AI validation.

Contribution

It provides a comprehensive empirical analysis of confidence interval behaviors in medical imaging AI, highlighting factors affecting their reliability and guiding future reporting standards.

Findings

01

Sample size requirements vary widely depending on study parameters.

02

Performance metric choice significantly impacts CI behavior.

03

Aggregation strategies influence CI reliability, especially for macro vs. micro metrics.

Abstract

Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging · Radiology practices and education