From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2
Kieran D. Lamb, Joseph Hughes, Spyros Lytras, Francesca Young, Orges Koci, James C. Herzig, Simon C. Lovell, Joe Grove, Ke Yuan, David L. Robertson

TL;DR
This paper shows that the protein language model ESM-2 can predict the effects of mutations in SARS-CoV-2 proteins without needing multiple sequence alignments.
Contribution
The study demonstrates that ESM-2 can capture evolutionary constraints and variant effects directly from single sequences.
Findings
ESM-2 captures evolutionary constraints from single sequences, matching results from multiple sequence alignments.
ESM-2 representations encode evolutionary history and distinguish variants of concern based on receptor binding and antigenicity.
ESM-2 likelihoods identify epistatic interactions among sites in the protein.
Abstract
Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have been suggested to have the potential to capture functional properties of proteins. Here, we investigate how these representations enable assessment of variation due to mutation. Applied to the SARS-CoV-2 spike protein via in silico deep mutational scanning (DMS), the PLM ESM-2 captures evolutionary constraints directly from sequence context, recapitulating what normally requires MSA data. Unlike other state-of-the-art methods which require protein structures or multiple sequences for training, we show what can be accomplished using an unmodified pretrained PLM. Applied to SARS-CoV-2 variants across the pandemic, we demonstrate that ESM-2…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · SARS-CoV-2 and COVID-19 Research · vaccines and immunoinformatics approaches
