Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation
Le Zhang, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun

TL;DR
This paper introduces MSA-Augmenter, a generative model that creates novel protein sequences to improve multiple sequence alignments, thereby enhancing the accuracy of protein tertiary structure predictions especially when homologous data is limited.
Contribution
The paper presents a new generative language model that augments shallow MSAs with synthetic sequences, improving AlphaFold2's performance on challenging protein targets.
Findings
MSA-Augmenter generates sequences that retain co-evolutionary signals.
Augmented MSAs improve structure prediction accuracy on CASP14.
Enhancement is significant for proteins with limited homologs.
Abstract
The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database for similar sequences. However, not all protein sequences possess abundant homologous families, and consequently, AF2's performance can degrade on such queries, at times failing to produce meaningful results. To address this, we introduce a novel generative language model, MSA-Augmenter, which leverages protein-specific attention mechanisms and large-scale MSAs to generate useful, novel protein sequences not currently found in databases. These sequences supplement shallow MSAs, enhancing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
