Fast uncovering of protein sequence diversity from structure
Luca Alessandro Silva, Barthelemy Meynard-Piganeau, Carlo Lucibello, Christoph Feinauer

TL;DR
InvMSAFold is a fast inverse folding method that generates highly diverse protein sequences from structure, enabling efficient high-throughput virtual screening and protein design.
Contribution
The paper introduces InvMSAFold, a novel inverse folding approach that significantly improves sampling speed and diversity in protein sequence generation from structure.
Findings
Increased sequence diversity leads to greater biochemical variability.
Order of magnitude faster sampling compared to existing methods.
Potential for high-throughput virtual screening and protein design.
Abstract
We present InvMSAFold, an inverse folding method for generating protein sequences that is optimized for diversity and speed. For a given structure, InvMSAFold generates the parameters of a probability distribution over the space of sequences with pairwise interactions, capturing the amino acid covariances observed in Multiple Sequence Alignments (MSA) of homologous proteins. This allows for the efficient generation of highly diverse protein sequences while preserving structural and functional integrity. We show that this increased diversity in sampled sequences translates into greater variability in biochemical properties, highlighting the exciting potential of our method for applications such as protein design. The orders of magnitude improvement in sampling speed compared to existing methods unlocks new possibilities for high-throughput virtual screening.
Peer Reviews
Decision·ICLR 2025 Spotlight
* An interesting idea to approach the inverse folding problem (i.e. the problem of generating sequences that fold into a given structure). * Proposes a low-rank approximation of the couplings and fields of the lightweight sequence model. * Fast generation of sequences that fit a well to a given structure.
* The idea of generating a Potts model has already been proposed by Li et al. (2023).
The sampling speed of InvMSAFold is a lot faster than ESM-1F or ProteinMPNN, this is important when you want to generate millions of models, as I think could be reasonable for virtual screening/protein design applications. InvMSAFold seems able to sample more diverse regions of potential protein structure/function space than ESM-1F, again this is important when you are trying to select for particular properties (substrate specificity, thermostability). That InvMSAFold is able to capture resid
There is not a specific example taken through to the conclusion that the model preserves "structural and functional integrity". Functional integrity is what you want when you're designing new proteins/doing virtual screening. The authors should consider including such an example or clarifying this statement since that is a major claim of their paper. I was not clear on the InvMSAFold-AR/-PW. I understand that PW requires MCMC and AR does not but I wonder are there cases/tasks in which a PW vs
In their computational experiments, the authors demonstrated that the sequences generated by their models not only fold into the target structure but also exhibit greater diversity and more effectively capture the correlations between residues at different sites. Furthermore, the showed that this sequence diversity extends to other properties, such as predicted solubility and predicted and predicted thermostability. Overall, this paper represents a new methodological advancement.
1. The authors only compare their method with ESM-IF1, and do not compare their method with other state-of-the-art inverse folding methods. 2. In many places such as in section 1, "ESM-IF" was wrongly typed as "ESM-1F". This may lead readers to perceive the authors as lacking expertise. 3. The article contains too many grammatical errors.
Videos
Taxonomy
TopicsMachine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research · RNA and protein synthesis mechanisms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training
