Generative power of a protein language model trained on multiple sequence alignments
Damiano Sgarbossa, Umberto Lupo, Anne-Florence Bitbol

TL;DR
This paper introduces an iterative sequence generation method using the MSA Transformer model, demonstrating it produces high-quality protein sequences comparable or superior to traditional Potts models, especially for small families.
Contribution
The authors develop a novel iterative generation approach with MSA Transformer, outperforming Potts models in sequence quality and statistical accuracy for protein family modeling.
Findings
Generated sequences match natural sequences in homology and structure measures.
For small families, the method surpasses Potts models in sequence quality.
The approach better reproduces natural sequence statistics.
Abstract
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Protein Structure and Dynamics
MethodsAttention Is All You Need · Linear Layer · Adam · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Layer Normalization · Softmax
