# Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

**Authors:** Ana F. Rodrigues, Lucas Ferraz, Laura Balbi, Pedro Giesteira Cotovio, Catia Pesquita

PMC · DOI: 10.1038/s41598-026-45458-5 · Scientific Reports · 2026-03-26

## TL;DR

This study examines how pre-trained protein sequence embeddings perform in predicting the viability of AAV vectors, highlighting the importance of fine-tuning for optimal results.

## Contribution

The paper provides a systematic comparison of ProtBERT and ESM2 embeddings in the context of AAV capsid design, emphasizing the need for fine-tuning in sparse mutation datasets.

## Key findings

- Amino acid-level embeddings outperform sequence-level representations in supervised tasks before fine-tuning.
- Sequence-level embeddings perform better in unsupervised settings.
- Fine-tuning with task-specific labels is essential for optimal performance in sparse mutation datasets.

## Abstract

Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid–level embeddings outperform sequence-level representations in supervised predictive tasks, whereas global sequence-level embeddings tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performances. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

The online version contains supplementary material available at 10.1038/s41598-026-45458-5.

## Full-text entities

- **Genes:** S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}
- **Diseases:** cancer (MESH:D009369)
- **Chemicals:** -acid (MESH:D000143), amino acid (MESH:D000596)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Adeno-associated virus (species) [taxon 272636], adeno-associated virus 2 (no rank) [taxon 10804], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13039942/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13039942/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC13039942/full.md

---
Source: https://tomesphere.com/paper/PMC13039942