# MetaFX: feature extraction from whole-genome metagenomic sequencing data

**Authors:** Artem Ivanov, Vladimir Popov, Maxim Morozov, Evgenii Olekhnovich, Vladimir Ulyantsev

PMC · DOI: 10.1093/bioinformatics/btag018 · 2026-01-20

## TL;DR

MetaFX is a tool for analyzing metagenomic data to classify microbial communities and improve disease prediction accuracy.

## Contribution

MetaFX introduces a novel reference-free feature extraction method for metagenomic classification.

## Key findings

- MetaFX improves disease prediction accuracy by up to 17% compared to previous research.
- Classification results are improved by 9±10% compared to taxonomic analysis.
- Features can be visualized and annotated for biological insights.

## Abstract

Microbial communities consist of thousands of microorganisms and viruses and have a tight connection with an environment, such as gut microbiota modulation of host body metabolism. However, the direct relationship between the presence of certain microorganism and the host state often remains unknown. Toolkits using reference-based approaches are limited to microbes present in databases. Reference-free methods often require enormous resources for metagenomic assembly or results in many poorly interpretable features based on k-mers.

Here we present MetaFX—an open-source library for feature extraction from whole-genome metagenomic sequencing data and classification of groups of samples. Using a large volume of metagenomic samples deposited in databases, MetaFX compares samples grouped by metadata criteria (e.g. disease, treatment, etc.) and constructs genomic features distinct for certain types of communities. Features constructed based on statistical k-mer analysis and de Bruijn graphs partition. Those features are used in machine learning models for classification of novel samples. Extracted features can be visualized on de Bruijn graphs and annotated for providing biological insights. We demonstrate the utility of MetaFX by building classification models for 590 human gut samples with inflammatory bowel disease. Our results outperform the previous research disease prediction accuracy up to 17%, and improves classification results compared to taxonomic analysis by 9±10% on average.

MetaFX is a feature extraction toolkit applicable for metagenomic datasets analysis and samples classification. The source code, test data, and relevant information for MetaFX are freely accessible at https://github.com/ctlab/metafx under the MIT License. Alternatively, MetaFX can be obtained via http://doi.org/10.5281/zenodo.16949369.

## Linked entities

- **Diseases:** inflammatory bowel disease (MONDO:0005265)

## Full-text entities

- **Diseases:** inflammatory bowel disease (MESH:D015212)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12891910/full.md

---
Source: https://tomesphere.com/paper/PMC12891910