# Real-World Benchmarking and Validation of Foundation Model Transformers for Endometrial Cancer Subtyping from Histopathology

**Authors:** Vincent M. Wagner, Casey M. Cosgrove, Stephanie J. Chen, Daniel T. Griffin, Megan I. Samuelson, Michael J. Goodheart, Jesus Gonzalez-Bosquet

PMC · DOI: 10.21203/rs.3.rs-7689962/v1 · Research Square · 2025-11-04

## TL;DR

This study tests if open-source AI models can accurately classify endometrial cancer subtypes from tissue images and work well in real-world settings.

## Contribution

The paper introduces a real-world validation framework for foundation model transformers in endometrial cancer subtyping from histopathology.

## Key findings

- Foundation models outperformed CNNs in cross-validation with macro-AUCs of 0.799–0.860.
- UNI2 with CLAM achieved the highest external validation macro-AUC of 0.780.
- Foundation models maintained higher discrimination in real-world settings compared to CNNs.

## Abstract

To evaluate whether open-source histopathology foundation model pipelines, paired with attention-based multiple instance learning (MIL), can accurately classify molecular subtypes of endometrial cancer (EC) from whole-slide images (WSIs) and maintain performance in a real-world, independent cohort.

We assembled a public discovery cohort of 815 patients (1,195 WSIs) from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium, and an independent external cohort of 720 patients (1,357 WSIs) with molecular subtyping determined by mismatch repair immunohistochemistry plus TP53 and POLE sequencing. Four ImageNet-pretrained convolutional neural networks (CNNs) and six open-source foundation encoders using two MIL aggregation strategies (TransMIL and CLAM) were benchmarked within the STAMP pipeline. Models were trained with five-fold cross-validation and evaluated on an independent cohort. Macro–area under the receiver operating characteristic curve (AUC) was the primary outcome.

In cross-validation, foundation models outperformed CNNs (macro-AUC 0.799–0.860 vs 0.715–0.829). The best configuration (Virchow2 with CLAM) achieved macro-AUC 0.860 (95%CI, 0.839–0.880), macro-F1 score 0.607, and balanced accuracy 0.647. External validation showed substantial degradation for CNNs, while foundation models retained higher discrimination (macro-AUC 0.667–0.780). UNI2 with CLAM had the highest external macro-AUC (0.780), and Virchow2 with CLAM had the best balanced accuracy (0.525). Subtype-level AUCs for UNI2 with CLAM were highest for p53abn (0.851).

Open-source foundation model pipelines with attention-based MIL can deliver accurate and generalizable molecular subtyping of EC directly from WSIs. These models outperform CNNs in real-world validation, supporting their potential as scalable, cost-effective tools to guide precision oncology and triage confirmatory molecular testing.

## Linked entities

- **Genes:** TP53 (tumor protein p53) [NCBI Gene 7157], POLE (DNA polymerase epsilon, catalytic subunit) [NCBI Gene 5426]
- **Diseases:** endometrial cancer (MONDO:0002447)

## Full-text entities

- **Genes:** TP53 (tumor protein p53) [NCBI Gene 7157] {aka BCC7, BMFS5, LFS1, P53, TRP53}
- **Diseases:** EC (MESH:D016889), Cancer (MESH:D009369), CLAM (MESH:C548072)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12637824/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12637824/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12637824/full.md

---
Source: https://tomesphere.com/paper/PMC12637824