Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction
Ruben Weitzman, Peter M{\o}rch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Marks, Pascal Notin

TL;DR
Protriever is a novel end-to-end differentiable framework for protein homolog retrieval that improves fitness prediction accuracy and efficiency over traditional MSA-based methods, enabling scalable and adaptable protein modeling.
Contribution
It introduces Protriever, a flexible, task-agnostic, end-to-end differentiable approach that replaces costly MSA-based retrieval with efficient vector search for protein homologs.
Findings
Achieves state-of-the-art protein fitness prediction performance.
Operates two orders of magnitude faster than MSA-based methods.
Flexible and adaptable to different retrieval strategies and databases.
Abstract
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenomics and Phylogenetic Studies · Protein Structure and Dynamics · Bioinformatics and Genomic Networks
