Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models
Sajad Ghawami

TL;DR
This paper systematically audits the calibration of multimodal cancer survival models, revealing widespread miscalibration despite strong discriminative performance, and evaluates methods to improve calibration.
Contribution
First comprehensive fold-level calibration audit of multimodal WSI-genomics survival models, highlighting calibration issues and evaluating post-hoc calibration methods.
Findings
All tested models fail 1-calibration on most folds.
Gating-based fusion improves calibration over other methods.
Platt scaling reduces miscalibration without harming discrimination.
Abstract
Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
