Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability
Ana F. Rodrigues, Lucas Ferraz, Laura Balbi, Pedro Giesteira Cotovio, Catia Pesquita

TL;DR
This study systematically evaluates protein sequence embeddings for bioengineering, revealing that fine-tuning and the level of sequence variation significantly impact predictive performance.
Contribution
It provides a comprehensive comparison of ProtBERT and ESM2 embeddings, highlighting the importance of fine-tuning and mutation scope in protein design tasks.
Findings
Amino acid-level embeddings outperform sequence-level in supervised tasks before fine-tuning.
Sequence-level embeddings perform better in unsupervised settings.
Fine-tuning with task-specific labels yields the best predictive performance.
Abstract
Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
