# PharaCon: a new framework for identifying bacteriophages via conditional representation learning

**Authors:** Zeheng Bai, Yao-zhong Zhang, Yuxuan Pang, Seiya Imoto

PMC · DOI: 10.1093/bioinformatics/btaf085 · 2025-02-24

## TL;DR

PharaCon is a new AI framework that improves the identification of bacteriophages in metagenomic data by incorporating label information during training.

## Contribution

The novel conditional BERT framework introduces label constraints during pre-training and fine-tuning for improved phage classification.

## Key findings

- PharaCon outperforms existing methods in identifying bacteriophages from mixed metagenomic sequences.
- Conditional BERT pre-training with label-specific representations enhances model performance and efficiency.
- The framework effectively handles label imbalance in bacterial and phage data during training.

## Abstract

Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.

To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model’s input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon’s effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.

The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.

## Full-text entities

- **Diseases:** CHVD (MESH:D001734)
- **Chemicals:** nucleotide (MESH:D009711), BAC (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11928753/full.md

---
Source: https://tomesphere.com/paper/PMC11928753