# Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome

**Authors:** Ami G. Sangster, Cameron Dufault, Haoning Qu, Denise Le, Julie D. Forman-Kay, Alan M. Moses

PMC · DOI: 10.1371/journal.pcbi.1012929 · 2025-11-11

## TL;DR

This paper introduces a new method to identify and categorize functional regions in proteins using a language model without training, outperforming existing tools and revealing new biological insights.

## Contribution

A zero-shot segmentation method using ProtT5 embeddings to identify and categorize protein segments without training or fine-tuning.

## Key findings

- ZPS boundary predictions outperform existing tools in reproducing UniProt annotations for the human proteome.
- ProtT5 embeddings of ZPS segments can categorize over 200 common UniProt annotations, including domains and disordered regions.
- ZPS identifies unannotated functional regions like mitochondrion targeting signals and SYGQ-rich prion-like domains.

## Abstract

The biological function of a protein is often determined by its distinct functional units, such as folded domains and intrinsically disordered regions. Identifying and categorizing these protein segments from sequence has been a major focus in computational biology which has enabled the automatic annotation of folded protein domains. Here we show that embeddings from the unsupervised protein language model ProtT5 can be used to identify and categorize protein segments without relying on conserved patterns in primary amino acid sequence. We present Zero-shot Protein Segmentation (ZPS), where we use embeddings from ProtT5 to predict the boundaries of protein segments without training or fine-tuning any parameters. We find that ZPS boundary predictions for the human proteome are better at reproducing reviewed annotations from UniProt than established bioinformatics tools and ProtT5 embeddings of ZPS segments can categorize over 200 of the most common UniProt annotations in the human proteome, including folded domains, sub-domains, and intrinsically disordered regions. To explore ZPS predictions, we introduce a new way to visualize protein embeddings that closely resembles diagrams of distinct functional units in protein biology. Since ZPS and segment embeddings can be used without training or fine-tuning, the approach is not biased towards known annotations and can be used to identify and categorize unannotated protein segments. We used the segment embeddings to identify unannotated mitochondrion targeting signals and SYGQ-rich prion-like domains, which are functional regions within intrinsically disordered regions. We expect that the analysis of protein segment embedding similarity can lead to valuable information about protein function, including about intrinsically disordered regions and poorly understood protein regions.

Understanding protein function has been a major focus of computational biology for decades. Classical approaches have used amino acid sequences and compositional biases that are conserved over evolution to identify protein segments that are associated with specific biological functions. Our results put forward a new approach for identifying protein segments which can be associated with specific biological functions. This approach applies zero-shot segmentation to protein language model embeddings and does not require any training or fine-tuning, so it has the potential to generalize to rare and poorly annotated protein segments. We also present a new approach for visualizing protein language model embeddings using colours to indicate the similarity of protein segments in the embedding space.

## Linked entities

- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12617893/full.md

---
Source: https://tomesphere.com/paper/PMC12617893