Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin; Sikha O K; Julia Rodriguez-Comas; Matthias Stefan May; Zhaohong Pan; Xiang Zhou; Xiaokun Liang; Franciskus Xaverius Erick; Andrea Prenner; Cedric Hemon; Valentin Boussot; Jean-Louis Dillenseger; Jean-Claude Nunes; Abdul Qayyum; Moona Mazher; Steven A Niederer; Kaisar Kushibar; Carlos Martin-Isla; Petia Radeva; Karim Lekadir; Theodore Barfoot; Luis C. Garcia Peraza Herrera; Ben Glocker; Tom Vercauteren; Lucas Gago; Justin Englemann; Joy-Marie Kleiss; Anton Aubanell; Andreu Antolin; Javier Garcia-Lopez; Miguel A. Gonzalez Ballester; Adrian Galdran

arXiv:2505.08685·cs.CV·October 15, 2025

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher

PDF

TL;DR

This paper introduces the CURVAS challenge, emphasizing the importance of multi-annotator ground truth, calibration, and uncertainty estimation in developing reliable deep learning models for multiorgan medical image segmentation.

Contribution

It presents a comprehensive challenge evaluating DL models on multi-annotator data, focusing on calibration and uncertainty, and demonstrates the benefits of diverse training data and pre-trained knowledge.

Findings

01

Better calibration correlates with higher segmentation quality.

02

Models trained on diverse datasets show increased robustness.

03

High-performing models achieved strong DSC and well-calibrated uncertainty estimates.

Abstract

Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN