Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

Ariel Lubonja; Pedro R. A. S. Bassi; Wenxuan Li; Hualin Qiao; Randal Burns; Alan L. Yuille; Zongwei Zhou

arXiv:2512.19091·cs.CV·December 23, 2025

Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L. Yuille, Zongwei Zhou

PDF

Open Access

TL;DR

This paper introduces RankInsight, a toolkit that enhances medical AI challenge evaluations by incorporating statistical significance testing, organ-specific metrics, and demographic fairness analysis, leading to more reliable and equitable rankings.

Contribution

The paper presents RankInsight, a comprehensive toolkit that addresses limitations in current medical AI leaderboards by adding significance testing, organ-specific metrics, and intersectional fairness analysis.

Findings

01

The nnU-Net family statistically outperforms others with high certainty.

02

Organ-specific metrics can reverse model rankings.

03

Significant demographic fairness gaps are identified in current models.

Abstract

Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Machine Learning in Healthcare