Variational auto-encoding of protein sequences
Sam Sinai, Eric Kelsic, George M. Church, Martin A. Nowak

TL;DR
This paper introduces a variational auto-encoder model to embed protein sequences, enabling better prediction of mutation effects and understanding of sequence-function relationships, which advances protein analysis and design.
Contribution
The paper presents a novel unsupervised variational auto-encoder approach for protein sequences that outperforms baseline methods and sometimes surpasses state-of-the-art models in predicting mutation impacts.
Findings
Better mutation effect prediction than baseline methods
Outperforms some state-of-the-art inverse-Potts models
Facilitates exploration of protein sequence space
Abstract
Proteins are responsible for the most diverse set of functions in biology. The ability to extract information from protein sequences and to predict the effects of mutations is extremely valuable in many domains of biology and medicine. However the mapping between protein sequence and function is complex and poorly understood. Here we present an embedding of natural protein sequences using a Variational Auto-Encoder and use it to predict how mutations affect protein function. We use this unsupervised approach to cluster natural variants and learn interactions between sets of positions within a protein. This approach generally performs better than baseline methods that consider no interactions within sequences, and in some cases better than the state-of-the-art approaches that use the inverse-Potts model. This generative model can be used to computationally guide exploration of protein…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Machine Learning in Bioinformatics
