Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome
Ami G. Sangster, Cameron Dufault, Haoning Qu, Denise Le, Julie D. Forman-Kay, Alan M. Moses

TL;DR
This paper introduces a new method to identify and categorize functional regions in proteins using a language model without training, outperforming existing tools and revealing new biological insights.
Contribution
A zero-shot segmentation method using ProtT5 embeddings to identify and categorize protein segments without training or fine-tuning.
Findings
ZPS boundary predictions outperform existing tools in reproducing UniProt annotations for the human proteome.
ProtT5 embeddings of ZPS segments can categorize over 200 common UniProt annotations, including domains and disordered regions.
ZPS identifies unannotated functional regions like mitochondrion targeting signals and SYGQ-rich prion-like domains.
Abstract
The biological function of a protein is often determined by its distinct functional units, such as folded domains and intrinsically disordered regions. Identifying and categorizing these protein segments from sequence has been a major focus in computational biology which has enabled the automatic annotation of folded protein domains. Here we show that embeddings from the unsupervised protein language model ProtT5 can be used to identify and categorize protein segments without relying on conserved patterns in primary amino acid sequence. We present Zero-shot Protein Segmentation (ZPS), where we use embeddings from ProtT5 to predict the boundaries of protein segments without training or fine-tuning any parameters. We find that ZPS boundary predictions for the human proteome are better at reproducing reviewed annotations from UniProt than established bioinformatics tools and ProtT5…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Bioinformatics and Genomic Networks
