Protein language models trained on multiple sequence alignments learn phylogenetic relationships
Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol

TL;DR
This paper shows that protein language models trained on multiple sequence alignments inherently learn phylogenetic relationships, which helps distinguish functional signals from evolutionary history in protein analysis.
Contribution
It reveals that MSA Transformer encodes phylogenetic relationships through column attention and improves robustness of contact prediction against phylogenetic noise.
Findings
Column attentions correlate with sequence Hamming distances.
Models can separate coevolutionary signals from phylogenetic correlations.
MSA Transformer outperforms Potts models in noisy, phylogeny-influenced data.
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Protein Structure and Dynamics
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Multi-Head Attention
