# Explainable AI for Predicting Mortality Risk in Metastatic Cancer: Retrospective Cohort Study Using the Memorial Sloan Kettering-Metastatic Dataset

**Authors:** Polycarp Nalela, Deepthi Rao, Praveen Rao

PMC · DOI: 10.2196/74196 · JMIR Cancer · 2026-01-13

## TL;DR

This study uses machine learning to predict survival in metastatic cancer patients and identifies key factors affecting mortality risk.

## Contribution

The novel contribution is the development of explainable ML models for metastatic cancer survival prediction using the MSK-MET dataset.

## Key findings

- XGBoost achieved the highest performance (accuracy=0.74; AUC=0.82) in predicting survival.
- Metastatic site count, tumor mutational burden, and specific metastases were identified as strong prognostic factors.
- Prostate cancer had the highest predictive accuracy (AUC=0.88), while pancreatic cancer was more challenging (AUC=0.68).

## Abstract

Metastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology.

This study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques.

We performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold.

XGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification.

Explainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use.

## Linked entities

- **Diseases:** metastatic cancer (MONDO:0024880), prostate cancer (MONDO:0005159), pancreatic cancer (MONDO:0005192)

## Full-text entities

- **Genes:** SLTM (SAFB like transcription modulator) [NCBI Gene 79811] {aka Met}, SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, MCC (MCC regulator of Wnt signaling pathway) [NCBI Gene 4163] {aka MCC1}, SIK1 (salt inducible kinase 1) [NCBI Gene 150094] {aka DEE30, MSK, SIK, SIK-1, SIK1B, SNF1LK}
- **Diseases:** Metastatic Cancer (MESH:D009369), lung cancer (MESH:D008175), head and neck cancer (MESH:D006258), CPH (MESH:D030401), thyroid cancer (MESH:D013964), colorectal and soft tissue sarcoma (MESH:D012509), Pancreatic cancer (MESH:D010190), lymph node metastasis (MESH:D008207), Non-small cell lung cancer (MESH:D002289), liver (MESH:D017093), colorectal and prostate cancer (MESH:D015179), skin lesions (MESH:D012871), fatalities (MESH:C565541), Metastatic (MESH:D000092182), lung (MESH:D008171), breast cancer (MESH:D001943), bone (MESH:D001847), Metastasis (MESH:D009362), anal cancer (MESH:D001005), Memorial Sloan Kettering-Metastatic (MESH:D008569), Prostate cancer (MESH:D011471), disease (MESH:D004194), death (MESH:D003643)
- **Chemicals:** CPH (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12848487/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12848487/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/PMC12848487/full.md

---
Source: https://tomesphere.com/paper/PMC12848487