# A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification

**Authors:** Syed Naseer Ahmad Shah, Kaartik Issar, Rafat Parveen, Suyan Tian, Suyan Tian, Suyan Tian

PMC · DOI: 10.1371/journal.pone.0342160 · PLOS One · 2026-02-05

## TL;DR

This paper introduces a new method combining PCA and mutual information to improve lung cancer classification using gene expression data, achieving high accuracy with a CNN classifier.

## Contribution

The novel hybrid PCA-MI framework improves feature extraction for lung cancer classification by integrating PCA and MI with CNN and PPI analysis.

## Key findings

- The PCA-MI framework achieved 98% accuracy and 98% precision in lung cancer classification using a CNN.
- PPI analysis identified biologically significant hub genes from the ranked features.
- The hybrid framework outperformed ten other feature extraction methods in benchmarking tests.

## Abstract

Lung cancer remains a leading cause of cancer-related mortality worldwide, with early and accurate diagnosis posing a critical challenge for improving patient outcomes. Gene expression data provide crucial insights for lung cancer classification by revealing underlying biological mechanisms. However, the high dimensionality of such data presents challenges, including computational complexity and overfitting risks. This study proposes a hybrid feature extraction framework combining Principal Component Analysis (PCA) and Mutual Information (MI) to address these issues. PCA reduces dimensionality by capturing key variance patterns, while MI selects features highly relevant to the target class, ensuring an informative and concise feature set. Gene expression datasets from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) were integrated, focusing on common genes. The hybrid PCA-MI framework was applied to rank genes, and the selected features were used to train a Convolutional Neural Network (CNN) for lung cancer classification. The genes ranked by the hybrid model were further analysed using protein-protein interaction (PPI) networks to identify hub genes, enhancing biological interpretability. The proposed framework was benchmarked against ten other feature extraction methods, including Lasso, Random Forest, Autoencoder, and PCA alone. The CNN classifier achieved superior performance with the PCA-MI features, attaining 98% accuracy and 98% precision. Training and validation curves demonstrated stable learning behaviour, and confusion matrix analysis confirmed robust predictions. Hub gene identification through PPI analysis validated the biological significance of the ranked genes. This study presents a robust framework for lung cancer classification by leveraging the strengths of PCA and MI, integrating deep learning and PPI analysis to address high-dimensional data challenges, and setting a foundation for future research in multi-omics data integration and enhanced diagnostic strategies.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138)

## Full-text entities

- **Genes:** CFTR (CF transmembrane conductance regulator) [NCBI Gene 1080] {aka ABC35, ABCC7, CF, CFTR/MRP, MRP7, TNR-CFTR}, CD4 (CD4 molecule) [NCBI Gene 920] {aka CD4mut, IMD79, Leu-3, OKT4D, T4}, KL (klotho) [NCBI Gene 9365] {aka HFTC3, KLA}, CYP51A1 (cytochrome P450 family 51 subfamily A member 1) [NCBI Gene 1595] {aka CP51, CYP51, CYPL1, LDM, P450-14DM, P450L1}, TNMD (tenomodulin) [NCBI Gene 64102] {aka BRICD4, CHM1L, TEM}, RAD51 (RAD51 recombinase) [NCBI Gene 5888] {aka BRCC5, FANCR, HRAD51, HsRad51, HsT16930, MRMV2}, NFYA (nuclear transcription factor Y subunit alpha) [NCBI Gene 4800] {aka CBF-A, CBF-B, HAP2, NF-YA}, BRCA1 (BRCA1 DNA repair associated) [NCBI Gene 672] {aka BRCAI, BRCC1, BROVCA1, FANCS, IRIS, PNCA4}
- **Diseases:** Large Cell Carcinoma (MESH:D018287), Adenocarcinoma (MESH:D000230), PCA (MESH:C566443), Cancer (MESH:D009369), Lung Cancer (MESH:D008175), Squamous Cell Carcinoma (MESH:D002294), NSCLC (MESH:D002289), oat cell" carcinoma (MESH:D018288), SCLC (MESH:D055752)
- **Chemicals:** ICGC (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12875479/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12875479/full.md

## References

91 references — full list in the complete paper: https://tomesphere.com/paper/PMC12875479/full.md

---
Source: https://tomesphere.com/paper/PMC12875479