Pairing interacting protein sequences using masked language modeling
Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol

TL;DR
This paper introduces DiffPALM, a novel method leveraging protein language models to accurately pair interacting protein sequences, significantly improving structure prediction of protein complexes using deep learning.
Contribution
The paper presents DiffPALM, a differentiable approach that exploits MSA Transformer to pair protein sequences, outperforming existing methods on challenging benchmarks without requiring fine-tuning.
Findings
DiffPALM outperforms existing coevolution-based pairing methods.
It improves structure prediction accuracy of protein complexes with AlphaFold-Multimer.
The method works well even with shallow multiple sequence alignments.
Abstract
Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids. We show that it captures inter-chain coevolution, while it was trained on single-chain data, which means that it can be used out-of-distribution. Relying on MSA Transformer without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
