# Linear B-cell epitope prediction for SARS and COVID-19 vaccine design: Integrating balanced ensemble learning models and resampling strategies

**Authors:** Fatih Gurcan

PMC · DOI: 10.7717/peerj-cs.2970 · PeerJ Computer Science · 2025-06-18

## TL;DR

This paper introduces a machine learning framework to improve B-cell epitope prediction for SARS and COVID-19 vaccines by combining resampling techniques and ensemble learning.

## Contribution

The novel integration of resampling strategies and balancing ensemble classifiers significantly improves B-cell epitope prediction accuracy.

## Key findings

- The combination of SMOTE-ENN and ExtraTrees achieved the highest ROC AUC score of 0.9899.
- Resampling methods like IHT with ExtraTrees also showed strong performance with an ROC AUC score of 0.9799.
- The framework reliably identifies potential epitope candidates for vaccine design.

## Abstract

This study presents a comprehensive machine learning framework to enhance the prediction accuracy of B-cell epitopes and antibody recognition related to Severe Acute Respiratory Syndrome (SARS) and Coronavirus Disease 2019 (COVID-19). To address the issue of data imbalance, various resampling techniques were applied using three types of strategies: over-sampling, under-sampling, and hybrid-sampling. The implemented resampling methods were designed to improve class balance and enhance model training. The rebalanced datasets were then used in model building with ensemble classifiers employing Boosting, Bagging, and Balancing strategies. Hyperparameter optimization for the classifiers was conducted using GridSearchCV, while feature selection was performed with the recursive feature elimination (RFE) algorithm. Model performance was evaluated using seven different metrics: Accuracy, Precision, Recall, F1-score, receiver operating characteristic area under the curve (ROC AUC), precision recall area under the curve (PR AUC), and Matthews correlation coefficient (MCC). Furthermore, statistical significance analyses including paired t-test, Wilcoxon, and permutation tests confirmed the reliability of the model improvements. To interpret the model’s predictive behavior, peptides with the highest confidence among correctly classified instances were identified as potential epitope candidates. The results indicate that the combination of Synthetic Minority Over-Sampling Technique—Edited Nearest Neighbors (SMOTE-ENN), and ExtraTrees yielded the best performance, achieving an ROC AUC score of 0.9899. The combination of Instance Hardness Threshold (IHT) and ExtraTrees followed closely with a score of 0.9799. These findings emphasize the effectiveness of integrating resampling models and balancing ensemble classifiers in improving the accuracy of B-cell epitope prediction and antibody recognition for SARS and COVID-19 infections. This study contributes to vaccine development efforts and the advancement of immunoinformatics research by identifying promising epitope candidates.

## Linked entities

- **Diseases:** Severe Acute Respiratory Syndrome (MONDO:0005091), Coronavirus Disease 2019 (MONDO:0100096)

## Full-text entities

- **Diseases:** SARS (MESH:D045169), COVID-19 (MESH:D000086382)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12193457/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12193457/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/PMC12193457/full.md

---
Source: https://tomesphere.com/paper/PMC12193457