ProtiGeno: a prokaryotic short gene finder using protein language models
Tony Tu, Gautham Krishna, Amirali Aghazadeh

TL;DR
ProtiGeno is a deep learning-based tool that improves the detection of short prokaryotic genes by leveraging protein language models trained on large datasets, outperforming existing gene finders.
Contribution
It introduces a novel deep learning method specifically designed for short prokaryotic gene prediction using protein language models, addressing limitations of current tools.
Findings
ProtiGeno achieves higher accuracy and recall for short genes in 4,288 genomes.
It effectively predicts both coding and noncoding short genes.
The model's features are interpretable through structural visualization.
Abstract
Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
