# Machine Learning Models Using General and Tissue-Specific Feature Extractors for Accurate Subtyping of Biopsy Samples: Advancing Lung Cancer Diagnosis in Latin America

**Authors:** Viviane Teixeira Loiola de Alencar, Felipe Navarro Balbino Alves, Guilherme de Souza Velozo, Luiz Edmundo Lopes Mizutani, Iusta Caminha, Gabriel Barbosa Silva, Vladmir Cláudio Cordeiro de Lima, Fábio Rocha Fernandes Távora

PMC · DOI: 10.1016/j.jtocrr.2025.100906 · 2025-09-18

## TL;DR

This paper introduces AI models that improve lung cancer subtype classification in biopsy samples, especially in Latin America where resources are limited.

## Contribution

The study introduces two novel DinoV2-based feature extractors, LungDino and OncoDino, tailored for lung cancer subtype classification in diverse and underrepresented regions.

## Key findings

- LungDino and OncoDino outperformed a ResNet baseline in classifying lung cancer subtypes from HE-stained WSIs.
- OncoDino showed strong performance in underrepresented categories like small cell carcinoma with an AUC of 0.99.
- Both models generated interpretable heatmaps for tumor localization, even in poorly differentiated cases.

## Abstract

Lung cancer is the leading cause of cancer-related deaths worldwide, with accurate histologic subtype classification critical for diagnosis and treatment planning. Diagnostic variability and resource disparities, particularly in underrepresented regions such as Latin America, pose substantial challenges. This study developed and evaluated novel artificial intelligence models trained on both global and Latin American pathology samples for subtype classification of hematoxylin and eosin (HE)–stained whole-slide images (WSIs).

Two DinoV2-based feature extractors, LungDino and OncoDino, trained on large data sets for task-specific and general pathology applications, were developed. The training data set consisted of 1308 HE-stained WSIs, including 412 adenocarcinomas, 323 squamous cell carcinomas, 41 small cell carcinomas, and 532 benign tissue samples, sourced from The Cancer Genome Atlas and an in-house Latin American data set. A ResNet model trained on ImageNet served as the baseline. Models were evaluated on 79 Latin American WSIs using receiver operating characteristic curves, and heatmaps were generated for tumor localization.

The DinoV2-based models outperformed the ResNet baseline. LungDino achieved the highest overall performance, with area under the curves of 0.97 for adenocarcinoma and 0.96 for squamous cell carcinoma. OncoDino excelled in underrepresented categories, achieving an area under the curve of 0.99 for small cell carcinoma, demonstrating its generalizability. Both models generated interpretable heatmaps, with LungDino demonstrating precise tumor localization. In the subset of samples classified as poorly differentiated or undifferentiated in HE pathology reports, the DinoV2 models also maintained high classification performance.

These findings underscore the effectiveness of task-specific and general feature extractors in delivering accurate, explainable results and address a gap in artificial intelligence–driven histopathology advancements, paving the way for future clinical applications.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138), adenocarcinoma (MONDO:0004970), squamous cell carcinoma (MONDO:0005096), small cell carcinoma (MONDO:0000402)

## Full-text entities

- **Diseases:** Cancer (MESH:D009369), small cell carcinoma (MESH:D018288), Lung Cancer (MESH:D008175), squamous cell carcinoma (MESH:D002294), adenocarcinoma (MESH:D000230)
- **Chemicals:** HE (-)

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12629879/full.md

---
Source: https://tomesphere.com/paper/PMC12629879