Data augmentation enables label-specific generation of homologous protein sequences
Lorenzo Rosset, Martin Weigt, Francesco Zamponi

TL;DR
This paper introduces a semi-supervised method combining protein language models and generative modeling to improve functional annotation and label-specific sequence generation in homologous protein families, addressing data scarcity.
Contribution
It presents a novel two-stage approach using pretrained embeddings and an annotation-aware probabilistic model for accurate annotation and targeted sequence generation.
Findings
High annotation accuracy across protein families
Generation of functionally coherent synthetic sequences
Effective use of limited labeled data
Abstract
Accurately annotating and controlling protein function from sequence data remains a major challenge, particularly within homologous families where annotated sequences are scarce and structural variation is minimal. We present a two-stage approach for semi-supervised functional annotation and conditional sequence generation in protein families using representation learning. First, we demonstrate that protein language models, pretrained on large and diverse sequence datasets and possibly finetuned via contrastive learning, provide embeddings that robustly capture fine-grained functional specificities, even with limited labeled data. Second, we use the inferred annotations to train a generative probabilistic model, an annotation-aware Restricted Boltzmann Machine, capable of producing synthetic sequences with prescribed functional labels. Across several protein families, we show that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
