# Leveraging Machine Learning for Severity Level-Wise Biomarker Identification in Prostate Cancer Microarray Gene Expression Data

**Authors:** Ahmed Al Marouf, Tarek A. Bismar, Sunita Ghosh, Jon G. Rokne, Reda Alhajj

PMC · DOI: 10.3390/biomedicines13102350 · Biomedicines · 2025-09-25

## TL;DR

This paper uses machine learning to identify biomarkers for different severity levels of prostate cancer based on gene expression data.

## Contribution

A novel ML framework is proposed for severity level-wise biomarker identification in prostate cancer using stratified validation and class imbalance handling.

## Key findings

- The ML framework achieved 96.85% accuracy using XGBoost for biomarker identification.
- The method effectively distinguishes critical biomarkers across five severity levels of prostate cancer.

## Abstract

Background: Prostate cancer is the most commonly occurring cancer amongst men. The detection and treatment of this cancer is therefore of great importance. The severity level of this cancer, which is established as a score in the Gleason Grading Group (GGC), guides the treatment of the cancer. Methods: In this paper, traditional machine learning (ML) classification methods such as Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and XGBoost (XGB), which have recently been shown to accurately identifying biomarkers for computational biology, are leveraged to find potential biomarkers for the different GGC scores. A ML framework that maps the Gleason Grading Group (GGG) into five severity levels—low, intermediate-low, intermediate, intermediate-high, and high—has been developed using the above methods. The microarray data for this ML method have been derived from immunohistochemical tests. The study includes severity level-wise biomarker identification, incorporating missing value imputation, class imbalance handling using the SMOTE-Tomek link method, and stratified k-fold validation to ensure robust biomarker selection. Results: The framework is evaluated on prostate cancer tissue microarray gene expression data from 1119 samples. A combination of high-aggressive and low-aggressive signatures are used in four experimental setups. The results demonstrate the effectiveness of the approach in distinguishing between critical biomarkers with highly accurate models, obtaining 96.85% accuracy using the XGBoost method. Conclusions: Leveraging ML gives a potential ground to involve the domain experts and the satisfactory results have approved that. For the future physician-in-the-loop approach can be tested to ensure further diagnosis impact.

## Linked entities

- **Diseases:** prostate cancer (MONDO:0005159)

## Full-text entities

- **Diseases:** cancer (MESH:D009369), Prostate Cancer (MESH:D011471)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12562123/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12562123/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12562123/full.md

---
Source: https://tomesphere.com/paper/PMC12562123