# ICCTax: a hierarchical taxonomic classifier for metagenomic sequences on a large language model

**Authors:** Yichun Gao, Jiaxing Bai, Feng Zhou, Yushuang He, Ying Wang, Xiaobing Huang

PMC · DOI: 10.1093/bioadv/vbaf257 · Bioinformatics Advances · 2025-10-15

## TL;DR

ICCTax is a new taxonomic classifier using a large language model to accurately identify species in metagenomic data across diverse environments.

## Contribution

ICCTax introduces a novel hierarchical classification method using HyenaDNA with metric learning for improved taxonomic accuracy.

## Key findings

- ICCTax outperforms baseline methods, especially on out-of-distribution data.
- It accurately classifies sequences to 155 genera and 43 phyla across four superkingdoms.
- Strong performance is demonstrated on real-world datasets like Tara Oceans and wastewater metagenomes.

## Abstract

Metagenomic data increasingly reflect the coexistence of species from Archaea, Bacteria, Eukaryotes, and Viruses in complex environments. Taxonomic classification across the four superkingdoms is essential for understanding microbial communities, exploring genomic evolutionary relationships, and identifying novel species. This task is inherently imbalanced, uneven, and hierarchical. Genomic sequences provide crucial information for taxonomy classification, but many existing methods relying on sequence similarity to reference genomes often leave sequences misclassified due to incomplete or absent reference databases. Large language models offer a novel approach to extract intrinsic characteristics from sequences.

We present ICCTax, a classifier integrating the large language model HyenaDNA with complementary-view-based hierarchical metric learning and hierarchical-level compactness loss to identify taxonomic genomic sequences. ICCTax accurately classifies sequences to 155 genera and 43 phyla across the four superkingdoms, including unseen taxa. Across three datasets built with different strategies, ICCTax outperforms baseline methods, particularly on Out-of-Distribution data. On Simulated Marine Metagenomic Communities datasets from three oceanic sites, DairyDB-16S rRNA, Tara Oceans, and wastewater metagenomic datasets, it demonstrates strong performance, showcasing real-world applicability. ICCTax can further support identification of novel species and functional genes across diverse environments, enhancing understanding of microbial ecology.

Code is available at https://github.com/Ying-Lab/ICCTax.

## Linked entities

- **Species:** Archaea (taxon 2157), Bacteria (taxon 2), Viruses (taxon 10239)

## Full-text entities

- **Diseases:** ID (MESH:D020243)
- **Chemicals:** nitrogen (MESH:D009584), carbon (MESH:D002244), AveP (-)
- **Species:** Pseudomonadota (proteobacteria, phylum) [taxon 1224], Homo sapiens (human, species) [taxon 9606], Bacillota (clostridial firmicutes, phylum) [taxon 1239], PX clade (clade) [taxon 569578]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12619997/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12619997/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12619997/full.md

---
Source: https://tomesphere.com/paper/PMC12619997