PharaCon: a new framework for identifying bacteriophages via conditional representation learning
Zeheng Bai, Yao-zhong Zhang, Yuxuan Pang, Seiya Imoto

TL;DR
PharaCon is a new AI framework that improves the identification of bacteriophages in metagenomic data by incorporating label information during training.
Contribution
The novel conditional BERT framework introduces label constraints during pre-training and fine-tuning for improved phage classification.
Findings
PharaCon outperforms existing methods in identifying bacteriophages from mixed metagenomic sequences.
Conditional BERT pre-training with label-specific representations enhances model performance and efficiency.
The framework effectively handles label imbalance in bacterial and phage data during training.
Abstract
Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples. To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model’s input. Additionally, we introduced a new fine-tuning…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Bacteriophages and microbial interactions
