# Learning the Language of Histopathology Images reveals Prognostic Subgroups in Invasive Lung Adenocarcinoma Patients

**Authors:** Abdul Rehman Akbar, Usama Sajjad, Ziyu Su, Wencheng Li, Fei Xing, Jimmy Ruiz, Wei Chen, Muhammad Khalid Khan Niazi

PMC · DOI: 10.21203/rs.3.rs-8089525/v1 · Research Square · 2026-01-16

## TL;DR

A new AI model called PathRosetta uses histopathology images to predict cancer recurrence and identify patient subgroups with different outcomes.

## Contribution

PathRosetta introduces a novel language-based AI framework for histopathology that improves recurrence prediction and reveals interpretable prognostic subgroups.

## Key findings

- PathRosetta outperformed existing grading and staging systems with an AUC of 0.78 for predicting five-year recurrence.
- The model generalized well to external datasets (TCGA and CPTAC) with AUCs of 0.75 and 0.76, respectively.
- It identified distinct prognostic subgroups within individual cell types based on morpho-spatial phenotypes.

## Abstract

Recurrence remains a major clinical challenge in surgically resected invasive lung adenocarcinoma, where existing grading and staging systems fail to capture the cellular complexity that underlies tumor aggressiveness. We present PathRosetta, a novel AI model that conceptualizes histopathology as a language, where cells serve as words, spatial neighborhoods form syntactic structures, and tissue architecture composes sentences. By learning this language of histopathology, PathRosetta predicts five-year recurrence directly from hematoxylin-and-eosin (H&E) slides, treating them as documents representing the state of the disease. In a multi-cohort dataset of 289 patients (600 slides), PathRosetta achieved an area under the curve (AUC) of 0.78±0.04 on the internal cohort, significantly outperforming IASLC grading (AUC:0.71), AJCC staging (AUC:0.64), and other state-of-the-art AI models (AUC:0.62–0.67). It yielded a hazard ratio of 9.54 and a concordance index of 0.70, generalized robustly to external TCGA (AUC:0.75) and CPTAC (AUC:0.76) cohorts, and performed consistently across demographic and clinical subgroups. Beyond whole-slide prediction, PathRosetta uncovered prognostic subgroups within individual cell types, revealing that even within benign epithelial, stromal, or other cells, distinct morpho-spatial phenotypes correspond to divergent outcomes. Moreover, because the model explicitly understands what it is looking at, including cell types, cellular neighborhoods, and higher-order tissue morphology, it is inherently interpretable and can articulate the rationale behind its predictions. These findings establish that representing histopathology as a language enables interpretable and generalizable prognostication from routine histology.

## Full-text entities

- **Diseases:** tumor (MESH:D009369), Lung Adenocarcinoma (MESH:D000077192)
- **Chemicals:** hematoxylin (MESH:D006416)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12869682/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12869682/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC12869682/full.md

---
Source: https://tomesphere.com/paper/PMC12869682