Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges
Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L. Yuille, Zongwei Zhou

TL;DR
This paper introduces RankInsight, a toolkit that enhances medical AI challenge evaluations by incorporating statistical significance testing, organ-specific metrics, and demographic fairness analysis, leading to more reliable and equitable rankings.
Contribution
The paper presents RankInsight, a comprehensive toolkit that addresses limitations in current medical AI leaderboards by adding significance testing, organ-specific metrics, and intersectional fairness analysis.
Findings
The nnU-Net family statistically outperforms others with high certainty.
Organ-specific metrics can reverse model rankings.
Significant demographic fairness gaps are identified in current models.
Abstract
Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Machine Learning in Healthcare
