# SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

**Authors:** David W Ludwig, Christopher Guptil, Nicholas R Alexander, Kateryna Zhalnina, Edi M -L Wipf, Albina Khasanova, Nicholas A Barber, Wesley Swingley, Donald M Walker, Joshua L Phillips

PMC · DOI: 10.1093/bioinformatics/btaf370 · Bioinformatics · 2025-06-25

## TL;DR

SetBERT is a deep learning platform that improves analysis of high-throughput sequencing data by capturing microbial interactions and providing explainable predictions.

## Contribution

SetBERT introduces a pre-training methodology for HTS data that captures sequence interactions and provides explainable predictions.

## Key findings

- SetBERT achieves 95% genus-level classification accuracy in taxonomic classification.
- SetBERT autonomously explains predictions by identifying biologically relevant taxa.

## Abstract

High-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias.

Addressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model.

All source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna.

## Full-text entities

- **Diseases:** MDS (MESH:C538175), cancerous (MESH:D009369), QIIME 2 (MESH:D020803), SFD (MESH:D009181)
- **Chemicals:** compounds (-), carbon (MESH:D002244), polymer (MESH:D011108)
- **Species:** Streptomyces (genus) [taxon 1883], Homo sapiens (human, species) [taxon 9606], Actinomycetota (actinobacteria, phylum) [taxon 201174], Pseudomonadota (proteobacteria, phylum) [taxon 1224], Bacillota (clostridial firmicutes, phylum) [taxon 1239]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12245400/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12245400/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12245400/full.md

---
Source: https://tomesphere.com/paper/PMC12245400