# Predicting clinical outcomes in Helicobacter pylori-positive patients using supervised learning through the integration of demographic and genomic features

**Authors:** Venkatesh Narasimhan, Sreya Pulakkat Warrier, Jobin Jacob John, Monisha Priya T., Niriksha Varadaraj, Greeshma Grace Thomas, Balaji Veeraraghavan

PMC · DOI: 10.1186/s12876-025-04595-3 · BMC Gastroenterology · 2026-01-29

## TL;DR

This study uses machine learning to predict gastric cancer outcomes in H. pylori-infected patients by combining host demographics and bacterial genomic data.

## Contribution

The novel integration of host metadata and H. pylori genomic features improves gastric cancer prediction accuracy and identifies new risk factors.

## Key findings

- XGBoost and Random Forest models achieved AUROC values of 0.950–0.954, significantly outperforming logistic regression.
- Patient age was the strongest predictor of gastric cancer, with genomic features beyond known virulence genes also playing a role.
- Explainability methods like SHAP enhance model interpretability for clinical use.

## Abstract

Helicobacter pylori (H. pylori) infection is widespread globally and is linked to outcomes ranging from chronic gastritis to gastric cancer. However, only a minority of infected individuals progress to malignancy, influenced by a mix of bacterial, host, and environmental factors. Current predictive approaches are limited due to relying mainly on clinical and lifestyle data. Genomic approaches have been sparsely used, and thus their incorporation into machine learning models could ensure early and personalized detection. This study aimed to evaluate the impact of integrating host metadata with genomic features from H. pylori to predict gastric cancer outcomes and identify associated variables.

One thousand three hundred sixty-three publicly available H. pylori genomes with associated host information between 1991 and 2024 were collected from NCBI and EnteroBase. Demographic features, virulence genes, sequence-derived and variant-based features were extracted. Machine learning models were then developed to classify infection outcomes into gastric cancer and non-gastric cancer and trained using internal cross-validation folds within the training set comprising 80% of the dataset. Logistic regression, an interpretable baseline model, was compared against higher-performance ensemble models (XGBoost, Random Forest). Final model performance was assessed on the held-out test set using recall, precision, AUROC, and AUPRC curves.

The logistic regression model achieved a recall of 0.737 (95% CI: 0.637–0.830) for gastric cancer and an AUROC of 0.830 (95% CI: 0.779–0.880). Both XGBoost and Random Forest models outperformed the baseline model with AUROC values ranging from 0.950 to 0.954 (95% CI: 0.904–0.976). Black-box model recall for gastric cancer detection improved compared to the baseline by 8.14% for XGBoost (0.797, 95% CI: 0.711–0.877), and 11.3% for Random Forest (0.820, 95% CI: 0.734–0.896). Across models, patient age consistently emerged as the strongest predictor of gastric cancer, with several sequence-derived genomic features beyond pre-established virulence genes contributing to the infection outcome differences.

This study demonstrates that combining pathogen genomics with host demographics uncovers novel risk factors and ensures early detection with high predictive power. The use of explainability methods like SHAP allows for greater interpretability by clinical professionals and improves informed decision-making processes. While internal validation showed strong performance, external validation on independent data and translation into clinical practice is necessary using broader, diverse datasets, along with the inclusion of additional host and lifestyle variables.

The online version contains supplementary material available at 10.1186/s12876-025-04595-3.

## Linked entities

- **Diseases:** gastric cancer (MONDO:0001056), chronic gastritis (MONDO:0005001)
- **Species:** Helicobacter pylori (taxon 210)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606], Helicobacter pylori (species) [taxon 210]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12922380/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12922380/full.md

## References

5 references — full list in the complete paper: https://tomesphere.com/paper/PMC12922380/full.md

---
Source: https://tomesphere.com/paper/PMC12922380