# EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences

**Authors:** Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J. Carreira

PMC · DOI: 10.3389/fbinf.2025.1514880 · 2025-09-30

## TL;DR

This paper introduces EPheClass, a machine learning pipeline for classifying diseases based on 16S rRNA gene data from microbiome samples, showing strong performance across multiple conditions.

## Contribution

The novel contribution is an ensemble-based classification pipeline for 16S rRNA data that generalizes well across different phenotypes and sample types.

## Key findings

- EPheClass achieved an F1 score of 0.913 in diagnosing periodontal disease using only 13 features.
- The method outperformed existing approaches in diagnosing inflammatory bowel disease using the same dataset.
- EPheClass showed competitive results in detecting antibiotic exposure, highlighting its generalizability.

## Abstract

One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.

## Linked entities

- **Genes:** 16S rRNA (16S ribosomal RNA) [NCBI Gene 2597965]
- **Diseases:** periodontal disease (MONDO:0002635), inflammatory bowel disease (MONDO:0005265)

## Full-text entities

- **Diseases:** IBD (MESH:D015212), periodontal disease (MESH:D010510)
- **Chemicals:** DES-P (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12518240/full.md

---
Source: https://tomesphere.com/paper/PMC12518240