# Deep-Learning-Based Classification of Lung Adenocarcinoma and Squamous Cell Carcinoma Using DNA Methylation Profiles: A Multi-Cohort Validation Study

**Authors:** Maram Fahaad Almufareh, Samabia Tehsin, Mamoona Humayun, Sumaira Kausar, Asad Farooq

PMC · DOI: 10.3390/cancers18040607 · Cancers · 2026-02-12

## TL;DR

A deep learning system using DNA methylation data can accurately distinguish between two types of lung cancer, helping doctors choose the right treatment.

## Contribution

A deep neural network using DNA methylation profiles achieves high accuracy in classifying lung cancer subtypes across multiple patient cohorts.

## Key findings

- The model achieved 96.92% accuracy on the TCGA test set with an AUC-ROC of 0.9981.
- The model generalized well to external datasets, achieving 88.92% accuracy on TCGA data when trained on GEO datasets.

## Abstract

Lung cancer stands as the leading global cancer killer which claims more lives than all other types of cancer combined. Doctors need to determine which lung cancer type patients have between adenocarcinoma and squamous cell carcinoma because these cancers need different treatment methods. A patient will not receive suitable treatment options for their condition when their medical condition remains unidentified by mistake. The research team developed a computer system which analyzes DNA methylation patterns to identify between these two cancer types. Our program learned from data on over a thousand patients and correctly identified the cancer type about 97 percent of the time. The system underwent testing on two distinct patient populations to verify its effectiveness across different patient groups rather than limited training data. The explanation tools we applied demonstrated which DNA markers hold the most importance, while showing our entire method in detail. The research gives medical personnel methods to diagnose diseases both quickly and precisely.

Background/Objectives: The precise classification of non-small-cell lung cancer (NSCLC) into lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) has important role in treatment decisions and in prognosis. Proper subtyping ensures that patients receive the most appropriate therapeutic strategies and allows clinicians to make informed evaluations regarding disease outcomes. This study presents a deep neural-network-based classification approach utilizing genome-wide DNA methylation profiles from the Illumina HumanMethylation450 BeadChip platform. Methods: A total of 5000 of the most discriminative CpG probes are identified through variance-based feature selection in the presented methodology, which are then classified through a five-layer deep neural network with batch normalization and dropout regularization. Training and validation were performed using data from The Cancer Genome Atlas (TCGA), with external validation conducted on two independent Gene Expression Omnibus (GEO) datasets: GSE39279 and GSE56044. Results: The model achieved 96.92% accuracy with an area under the receiver-operating characteristic curve (AUC-ROC) of 0.9981 on the TCGA test set. Robust generalization was obtained in cross-dataset validation experiments, with the GEO-trained model achieving 88.92% accuracy and 0.9724 AUC-ROC when validated on TCGA data. The most influential CpG biomarkers contributing to classification decisions are analysed using SHAP (Shapley Additive Explanations). Conclusions: These findings demonstrate the potential of DNA methylation-based deep learning approaches for reliable NSCLC subtype classification with clinical applicability.

## Linked entities

- **Diseases:** lung adenocarcinoma (MONDO:0005061), lung squamous cell carcinoma (MONDO:0005097), non-small-cell lung cancer (MONDO:0005233)

## Full-text entities

- **Genes:** ALK (ALK receptor tyrosine kinase) [NCBI Gene 238] {aka ALK1, CD246, NBLST3}, EGFR (epidermal growth factor receptor) [NCBI Gene 1956] {aka ERBB, ERBB1, ERRP, HER1, NISBD2, NNCIS}, ITIH2 (inter-alpha-trypsin inhibitor heavy chain 2) [NCBI Gene 3698] {aka H2P, ITI-HC2, SHAP}
- **Diseases:** metastases (MESH:D009362), deaths (MESH:D003643), injury to (MESH:D014947), SCLC (MESH:D055752), Adenocarcinomas (MESH:D000230), CUP (MESH:D009369), Lung cancer (MESH:D008175), LUAD (MESH:D000077192), alveolar adenocarcinoma (MESH:D002282), Squamous Cell Carcinoma (MESH:D002294), HNSC (MESH:D000077195), NSCLC (MESH:D002289)
- **Chemicals:** tyrosine (MESH:D014443)
- **Species:** Homo sapiens (human, species) [taxon 9606], Nicotiana tabacum (American tobacco, species) [taxon 4097]
- **Cell lines:** GEO1 — Homo sapiens (Human), Colon carcinoma, Cancer cell line (CVCL_0271)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12938923/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12938923/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12938923/full.md

---
Source: https://tomesphere.com/paper/PMC12938923