TL;DR
This paper introduces PhaMer, a Transformer-based tool that improves the accuracy of bacteriophage identification from metagenomic data by leveraging protein organization and self-attention mechanisms.
Contribution
It presents a novel Transformer-based approach with protein-cluster vocabulary for more accurate phage detection in complex metagenomic datasets.
Findings
PhaMer outperforms existing tools in multiple datasets.
Improves F1-score of phage detection by 27% in real data.
Effective on short contigs and simulated datasets.
Abstract
Motivation: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Multi-Head Attention · Softmax · Absolute Position Encodings · Byte Pair Encoding · Residual Connection · Layer Normalization
