# Stratifying risk of disease in haematuria patients using machine learning techniques to improve diagnostics

**Authors:** Anna Drożdż, Brian Duggan, Mark W. Ruddock, Cherith N. Reid, Mary Jo Kurth, Joanne Watt, Allister Irvine, John Lamont, Peter Fitzgerald, Declan O’Rourke, David Curry, Mark Evans, Ruth Boyd, Jose Sousa

PMC · DOI: 10.3389/fonc.2024.1401071 · 2024-05-08

## TL;DR

This study uses machine learning to classify haematuria patients into healthy or sick groups and identifies key biomarkers for better diagnosis.

## Contribution

The study introduces the CACTUS algorithm as a robust method for classifying haematuria patients in unbalanced datasets and identifies gender-specific biomarkers.

## Key findings

- CACTUS algorithm achieved balanced accuracy of 0.747 for both genders in classifying haematuria patients.
- Microalbumin, male gender, and tPSA were identified as the most informative biomarkers for the whole dataset.
- Gender-specific biomarkers like tPSA and cystatin C for males and IL-8 for females were found significant.

## Abstract

Detailed and invasive clinical investigations are required to identify the causes of haematuria. Highly unbalanced patient population (predominantly male) and a wide range of potential causes make the ability to correctly classify patients and identify patient-specific biomarkers a major challenge. Studies have shown that it is possible to improve the diagnosis using multi-marker analysis, even in unbalanced datasets, by applying advanced analytical methods. Here, we applied several machine learning algorithms to classify patients from the haematuria patient cohort (HaBio) by analysing multiple biomarkers and to identify the most relevant ones.

We applied several classification and feature selection methods (k-means clustering, decision trees, random forest with LIME explainer and CACTUS algorithm) to stratify patients into two groups: healthy (with no clear cause of haematuria) or sick (with an identified cause of haematuria e.g., bladder cancer, or infection). The classification performance of the models was compared. Biomarkers identified as important by the algorithms were also analysed in relation to their involvement in the pathological processes.

Results showed that a high unbalance in the datasets significantly affected the classification by random forest and decision trees, leading to the overestimation of the sick class and low model performance. CACTUS algorithm was more robust to the unbalance in the dataset. CACTUS obtained a balanced accuracy of 0.747 for both genders, 0.718 for females and 0.803 for males. The analysis showed that in the classification process for the whole dataset: microalbumin, male gender, and tPSA emerged as the most informative biomarkers. For males: age, microalbumin, tPSA, cystatin C, BTA, HAD and S100A4 were the most significant biomarkers while for females microalbumin, IL-8, pERK, and CXCL16.

CACTUS algorithm demonstrated improved performance compared with other methods such as decision trees and random forest. Additionally, we identified the most relevant biomarkers for the specific patient group, which could be considered in the future as novel biomarkers for diagnosis. Our results have the potential to inform future research and provide new personalised diagnostic approaches tailored directly to the needs of the individuals.

## Linked entities

- **Proteins:** S100A4 (S100 calcium binding protein A4), EIF2AK3 (eukaryotic translation initiation factor 2 alpha kinase 3), EPHB2 (EPH receptor B2)
- **Chemicals:** tPSA (PubChem CID 77890), IL-8 (PubChem CID 169410440)
- **Diseases:** bladder cancer (MONDO:0004986), infection (MONDO:0005550)

## Full-text entities

- **Genes:** CST3 (cystatin C) [NCBI Gene 1471] {aka ADLDWA, ARMD11, HEL-S-2}, S100A4 (S100 calcium binding protein A4) [NCBI Gene 6275] {aka 18A2, 42A, CAPL, FSP1, MTS1, P9KA}, HAAO (3-hydroxyanthranilate 3,4-dioxygenase) [NCBI Gene 23498] {aka 3-HAO, HAO, VCRL1, h3HAO}, CXCL8 (C-X-C motif chemokine ligand 8) [NCBI Gene 3576] {aka GCP-1, GCP1, IL8, LECT, LUCT, LYNAP}, EIF2AK3 (eukaryotic translation initiation factor 2 alpha kinase 3) [NCBI Gene 9451] {aka PEK, PERK, WRS}, CXCL16 (C-X-C motif chemokine ligand 16) [NCBI Gene 58191] {aka CXCLG16, SR-PSOX, SRPSOX}
- **Diseases:** infection (MESH:D007239), bladder cancer (MESH:D001749)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11109371/full.md

---
Source: https://tomesphere.com/paper/PMC11109371