Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets

Soumen Ghosh; Christine Jestin Hannan; Rajat Vashistha; Parveen Kundu; Sandra Brosda; Lauren G.Aoude; James Lonie; Andrew Nathanson; Jessica Ng; Andrew P. Barbour; Viktor Vegh

arXiv:2508.18612·eess.IV·August 27, 2025

Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets

Soumen Ghosh, Christine Jestin Hannan, Rajat Vashistha, Parveen Kundu, Sandra Brosda, Lauren G.Aoude, James Lonie, Andrew Nathanson, Jessica Ng, Andrew P. Barbour, Viktor Vegh

PDF

TL;DR

This study evaluates the generalization of 3D nnU-Net for PET-CT tumor segmentation across multiple cancer types and datasets, emphasizing dataset diversity over model complexity for clinical robustness.

Contribution

First cross-cancer evaluation of nnU-Net on PET-CT with novel multi-cohort datasets, highlighting dataset diversity's importance for robust generalization.

Findings

01

Combined training improves robustness across cohorts.

02

Dataset diversity outweighs architectural complexity for generalization.

03

Models trained on diverse data perform better on unseen domains.

Abstract

Robust generalization is essential for deploying deep learning based tumor segmentation in clinical PET-CT workflows, where anatomical sites, scanners, and patient populations vary widely. This study presents the first cross cancer evaluation of nnU-Net on PET-CT, introducing two novel, expert-annotated whole-body datasets. 279 patients with oesophageal cancer (Australian cohort) and 54 with lung cancer (Indian cohort). These cohorts complement the public AutoPET dataset and enable systematic stress-testing of cross domain performance. We trained and tested 3D nnUNet models under three paradigms. Target only (oesophageal), public only (AutoPET), and combined training. For the tested sets, the oesophageal only model achieved the best in-domain accuracy (mean DSC, 57.8) but failed on external Indian lung cohort (mean DSC less than 3.4), indicating severe overfitting. The public only model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.