Fast uncovering of protein sequence diversity from structure

Luca Alessandro Silva; Barthelemy Meynard-Piganeau; Carlo Lucibello; Christoph Feinauer

arXiv:2406.11975·q-bio.QM·May 30, 2025·ICLR

Fast uncovering of protein sequence diversity from structure

Luca Alessandro Silva, Barthelemy Meynard-Piganeau, Carlo Lucibello, Christoph Feinauer

PDF

Open Access 1 Video 3 Reviews

TL;DR

InvMSAFold is a fast inverse folding method that generates highly diverse protein sequences from structure, enabling efficient high-throughput virtual screening and protein design.

Contribution

The paper introduces InvMSAFold, a novel inverse folding approach that significantly improves sampling speed and diversity in protein sequence generation from structure.

Findings

01

Increased sequence diversity leads to greater biochemical variability.

02

Order of magnitude faster sampling compared to existing methods.

03

Potential for high-throughput virtual screening and protein design.

Abstract

We present InvMSAFold, an inverse folding method for generating protein sequences that is optimized for diversity and speed. For a given structure, InvMSAFold generates the parameters of a probability distribution over the space of sequences with pairwise interactions, capturing the amino acid covariances observed in Multiple Sequence Alignments (MSA) of homologous proteins. This allows for the efficient generation of highly diverse protein sequences while preserving structural and functional integrity. We show that this increased diversity in sampled sequences translates into greater variability in biochemical properties, highlighting the exciting potential of our method for applications such as protein design. The orders of magnitude improvement in sampling speed compared to existing methods unlocks new possibilities for high-throughput virtual screening.

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

* An interesting idea to approach the inverse folding problem (i.e. the problem of generating sequences that fold into a given structure). * Proposes a low-rank approximation of the couplings and fields of the lightweight sequence model. * Fast generation of sequences that fit a well to a given structure.

Weaknesses

* The idea of generating a Potts model has already been proposed by Li et al. (2023).

Reviewer 02Rating 8Confidence 3

Strengths

The sampling speed of InvMSAFold is a lot faster than ESM-1F or ProteinMPNN, this is important when you want to generate millions of models, as I think could be reasonable for virtual screening/protein design applications. InvMSAFold seems able to sample more diverse regions of potential protein structure/function space than ESM-1F, again this is important when you are trying to select for particular properties (substrate specificity, thermostability). That InvMSAFold is able to capture resid

Weaknesses

There is not a specific example taken through to the conclusion that the model preserves "structural and functional integrity". Functional integrity is what you want when you're designing new proteins/doing virtual screening. The authors should consider including such an example or clarifying this statement since that is a major claim of their paper. I was not clear on the InvMSAFold-AR/-PW. I understand that PW requires MCMC and AR does not but I wonder are there cases/tasks in which a PW vs

Reviewer 03Rating 5Confidence 5

Strengths

In their computational experiments, the authors demonstrated that the sequences generated by their models not only fold into the target structure but also exhibit greater diversity and more effectively capture the correlations between residues at different sites. Furthermore, the showed that this sequence diversity extends to other properties, such as predicted solubility and predicted and predicted thermostability. Overall, this paper represents a new methodological advancement.

Weaknesses

1. The authors only compare their method with ESM-IF1, and do not compare their method with other state-of-the-art inverse folding methods. 2. In many places such as in section 1, "ESM-IF" was wrongly typed as "ESM-1F". This may lead readers to perceive the authors as lacking expertise. 3. The article contains too many grammatical errors.

Videos

Fast Uncovering of Protein Sequence Diversity from Structure· slideslive

Taxonomy

TopicsMachine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research · RNA and protein synthesis mechanisms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training