# Early Risk Factor Prediction in Chronic Kidney Disease Diagnosis Using Feature Selection and Machine Learning Algorithms

**Authors:** Chowdhury Nazia Enam Prima, Martti Juhola

PMC · DOI: 10.1055/a-2797-4380 · Methods of Information in Medicine · 2026-02-13

## TL;DR

This study uses machine learning and feature selection to predict early risk factors for chronic kidney disease, improving diagnostic accuracy and patient care.

## Contribution

The novelty lies in combining feature selection techniques with multiple machine learning models to enhance early CKD risk prediction.

## Key findings

- Gradient boosting achieved the highest performance metrics across accuracy, precision, recall, and AUC.
- Hemoglobin was identified as the most significant risk factor for CKD using feature selection methods.
- Classifiers achieved 86-98% accuracy and AUC values over 0.96, demonstrating strong diagnostic potential.

## Abstract

Chronic kidney disease, CKD in short, is a kind of long-term kidney illness in which rapid deterioration of kidney function is observed over a period of time. Unlike other organs, this damage in kidney function cannot be recovered and reversed as well. Moreover, in its early stages, asymptomatic renal disease is highly prevalent, making early identification with conventional clinical approaches difficult. Thus, early and accurate detection of risk factors is a very challenging step in CKD diagnosis.

This research work showed earlier and effective identification of risk factors using notable feature selection techniques for the enhancement of patient care. It also aimed at the improvement of predictive diagnosis of CKD employing different supervised and ensemble machine learning classifiers.

A CKD-focused dataset consisting of 1,032 patient records and 14 features was used for this research purpose. This research emphasized on identifying the risk factors of CKD using feature importance (for tree-based model) with sequential feature selector and ReliefF algorithm as feature selection process. Based on the ranking for both feature selection techniques, the top 10 features were identified. Then utilizing those features, the classifiers such as random forest, support vector machine, Naïve Bayes, decision tree, logistic regression, gradient boosting,
K
-nearest neighbors, and ensemble classifier voting technique were trained using stratified 5-fold and grid-based search cross-validation techniques. After that, their performances were assessed using evaluation measures, i.e., accuracy, F1 score, precision, recall, training loss, test loss, bias, and AUC, to classify the individual having presence or absence of CKD.

The feature selection algorithms selected the significant data-driven top 10 features. Based on the ranking for both feature selection procedures, hemoglobin is determined to be the significant risk factor among these features. For both feature selection techniques, all the classifiers showed their best performance, having 86 to 98% of accuracy, AUC value of over 0.96 to 1.00, and bias value of 0.003 to 0.103. All the classifiers showed a very good trade-off between false positives and false negatives, with precision, recall, and F1 score ranging from 92 to 98%, 90 to 99%, and 93 to 98%, respectively, using feature importance with SFS. In both cases of the feature selection techniques, gradient boosting outperformed all other algorithms in terms of accuracy, precision, AUC, recall, F1 score, specificity, and bias.

To conclude, in the suggested methodology the feature selection algorithms effectively identified the prominent features based on their importance, and the pipeline demonstrated a good performance in diagnosing individuals at risk of CKD development. Some of the classifiers showed their effectiveness in CKD prediction using the selected features by achieving higher accuracy, F1 score, precision, recall, AUC, specificity, and lower bias to ensure the diagnostic performance. Therefore, it can be inferred that this proposed methodology, combining the power of these eight machine learning models with two efficient feature selection approaches, demonstrated that people at risk of this nephrological condition can be detected earlier, more accurately identifying increased risk factors than with conventional methods. This holds a great promise toward enhancing healthcare judgment and eventually ensuring treatment for patients.

## Linked entities

- **Diseases:** chronic kidney disease (MONDO:0005300)

## Full-text entities

- **Diseases:** Chronic Kidney Disease (MESH:D051436), kidney illness (MESH:D007674), CKD (MESH:D012080), nephrological condition (MESH:D020763), of kidney function (MESH:D007680)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12991859/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12991859/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC12991859/full.md

---
Source: https://tomesphere.com/paper/PMC12991859