PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song,, Petra Choi, Xuan Wang, Liqing Zhang

TL;DR
PathoLM is a novel genome foundation model that significantly improves pathogen identification from DNA sequences, especially for novel and divergent pathogens, by leveraging pre-trained DNA models and minimal fine-tuning.
Contribution
The paper introduces PathoLM, a pre-trained DNA language model that enhances pathogen detection and classification with minimal data and outperforms existing methods.
Findings
PathoLM outperforms existing models like DciPatho in zero-shot and few-shot scenarios.
PathoLM demonstrates superior accuracy in identifying diverse bacterial and viral pathogens.
Expanded PathoLM-Sp shows improved performance in ESKAPEE species classification.
Abstract
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlant Disease Resistance and Genetics · Microbial infections and disease research
MethodsSparse Evolutionary Training · Linear Layer · Multi-Head Attention · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout
