Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Kirscher Tristan (ICube,DKFZ); Bujotzek Markus (DKFZ); Kirchhoff Yannick (DKFZ); Rokuss Maximilian (DKFZ); Isensee Fabian (DKFZ); Kahl Kim-Celine (DKFZ); Kovacs Balint (DKFZ); Maier-Hein Klaus (DKFZ)

arXiv:2605.18329·cs.CV·May 19, 2026

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Kirscher Tristan (ICube,DKFZ), Bujotzek Markus (DKFZ), Kirchhoff Yannick (DKFZ), Rokuss Maximilian (DKFZ), Isensee Fabian (DKFZ), Kahl Kim-Celine (DKFZ), Kovacs Balint (DKFZ), Maier-Hein Klaus (DKFZ)

PDF

1 Repo

TL;DR

This paper examines the differences between cross-validation ensembles and deep ensembles in medical image segmentation, highlighting how their construction impacts uncertainty estimation and suggesting appropriate use cases for each.

Contribution

It clarifies the distinctions between CV and deep ensembles in uncertainty estimation, compares their performance, and offers a practical modification for deep ensemble training within nnU-Net.

Findings

01

Deep ensembles improve calibration and failure detection.

02

CV ensembles sometimes better correlate with inter-rater variability.

03

Ensemble choice should align with specific research goals.

Abstract

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kirscher/LostInFolds
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.