# Forest-EMCBE: an evolutionary ensemble learning algorithm for multiclass diagnosis of bacterial pneumonia using the CBC dataset

**Authors:** Yimin Shen, Xiaotian Xu, Xiaoxi Hao, Cuimin Sun, Wei Lan

PMC · DOI: 10.3389/fbinf.2026.1792643 · Frontiers in Bioinformatics · 2026-03-18

## TL;DR

This paper introduces a new machine learning algorithm for faster and more accurate diagnosis of bacterial pneumonia using blood test data.

## Contribution

The novel Forest-EMCBE algorithm combines genetic algorithms, error-correcting codes, and balanced sampling to handle class imbalance in medical datasets.

## Key findings

- Forest-EMCBE outperformed 11 state-of-the-art algorithms on a CBC dataset with 1,457 samples and 4 pneumonia classes.
- The algorithm's three-layer structure improved classifier generalization in multiclass imbalanced medical data.
- Feature importance analysis revealed how age, gender, and neutrophil percentage impact predictions for different bacterial infections.

## Abstract

Rapid diagnosis of bacterial pneumonia is crucial for clinical diagnosis and treatment, but traditional methods are time-consuming. The wide application of machine learning techniques in medical diagnosis provides an effective way to solve this problem. However, the complexity of medical datasets and the problem of class imbalance poses serious challenges to classical machine learning algorithms.

Aiming at the multiclass imbalanced problem in complete blood count (CBC) datasets, this study proposes a novel ensemble learning algorithm, Forest of Evolutionary Multi-Classifiers Based on Bagging with Error-Correcting Output Coding (Forest-EMCBE). The algorithm integrates Multi-Objective Genetic Algorithm, Error-Correcting Output Codes (ECOC), and balanced sampling strategy, which enhances the generalization ability of the classifiers through a three-layer integrated structure.

To validate the effectiveness of the proposed method, we trained the diagnostic model on a CBC dataset, which contains 1,457 samples and 4 different classes of bacterial pneumonia results, and compared it with 11 state-of-the-art algorithms. The experimental results demonstrate the superior performance of the Forest-EMCBE algorithm on the CBC dataset, outperforming all other compared algorithms.

Based on the Shapley value-based feature importance analysis method, this study dissects the contributions of key features to the prediction outcomes and further elucidates the differential impacts of features such as age, gender, and neutrophil percentage on predicting infections by different bacterial species.

## Linked entities

- **Diseases:** bacterial pneumonia (MONDO:0004652)

## Full-text entities

- **Diseases:** infections (MESH:D007239), bacterial pneumonia (MESH:D018410)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13039039/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13039039/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC13039039/full.md

---
Source: https://tomesphere.com/paper/PMC13039039