Importance Weighted Expectation-Maximization for Protein Sequence Design
Zhenqiao Song, Lei Li

TL;DR
This paper introduces IsEM-Pro, a novel latent generative model combined with Markov random fields, to efficiently generate diverse, high-fitness protein sequences, outperforming previous methods significantly.
Contribution
The paper presents IsEM-Pro, a new approach that integrates Monte Carlo EM with MRFs for protein sequence design, enhancing diversity and fitness in generated sequences.
Findings
Outperforms previous methods by at least 55% on average fitness score
Generates more diverse and novel protein sequences
Effective in eight different protein design tasks
Abstract
Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Evolutionary Algorithms and Applications
