Data augmentation enables label-specific generation of homologous protein sequences

Lorenzo Rosset; Martin Weigt; Francesco Zamponi

arXiv:2507.15651·q-bio.QM·July 22, 2025

Data augmentation enables label-specific generation of homologous protein sequences

Lorenzo Rosset, Martin Weigt, Francesco Zamponi

PDF

TL;DR

This paper introduces a semi-supervised method combining protein language models and generative modeling to improve functional annotation and label-specific sequence generation in homologous protein families, addressing data scarcity.

Contribution

It presents a novel two-stage approach using pretrained embeddings and an annotation-aware probabilistic model for accurate annotation and targeted sequence generation.

Findings

01

High annotation accuracy across protein families

02

Generation of functionally coherent synthetic sequences

03

Effective use of limited labeled data

Abstract

Accurately annotating and controlling protein function from sequence data remains a major challenge, particularly within homologous families where annotated sequences are scarce and structural variation is minimal. We present a two-stage approach for semi-supervised functional annotation and conditional sequence generation in protein families using representation learning. First, we demonstrate that protein language models, pretrained on large and diverse sequence datasets and possibly finetuned via contrastive learning, provide embeddings that robustly capture fine-grained functional specificities, even with limited labeled data. Second, we use the inferred annotations to train a generative probabilistic model, an annotation-aware Restricted Boltzmann Machine, capable of producing synthetic sequences with prescribed functional labels. Across several protein families, we show that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.