LA4SR: illuminating the dark proteome with generative AI
David R. Nelson, Ashish Kumar Jaiswal, Noha Ismail, Alexandra, Mystikou, Kourosh Salehi-Ashtiani

TL;DR
This paper introduces LA4SR, a generative AI approach that significantly advances microbial protein classification, especially for uncharacterized dark proteome regions, achieving high accuracy and speed improvements over traditional methods.
Contribution
The study re-engineers open-source language models for microbial sequence classification, demonstrating high accuracy, robustness to incomplete data, and providing interpretability tools for biological insights.
Findings
Achieved up to 95 F1 score in microbial sequence classification.
Models operated 16,580x faster than BLASTP.
High accuracy (>86 F1) with less than 2% training data.
Abstract
AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (>1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Machine Learning in Bioinformatics · Cell Image Analysis Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Attention Dropout · Multi-Head Attention · Residual Connection · Softmax · Weight Decay
