Generative power of a protein language model trained on multiple   sequence alignments

Damiano Sgarbossa; Umberto Lupo; Anne-Florence Bitbol

arXiv:2204.07110·q-bio.BM·December 30, 2024

Generative power of a protein language model trained on multiple sequence alignments

Damiano Sgarbossa, Umberto Lupo, Anne-Florence Bitbol

PDF

Open Access 1 Repo

TL;DR

This paper introduces an iterative sequence generation method using the MSA Transformer model, demonstrating it produces high-quality protein sequences comparable or superior to traditional Potts models, especially for small families.

Contribution

The authors develop a novel iterative generation approach with MSA Transformer, outperforming Potts models in sequence quality and statistical accuracy for protein family modeling.

Findings

01

Generated sequences match natural sequences in homology and structure measures.

02

For small families, the method surpasses Potts models in sequence quality.

03

The approach better reproduces natural sequence statistics.

Abstract

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bitbol-lab/iterative_masking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Protein Structure and Dynamics

MethodsAttention Is All You Need · Linear Layer · Adam · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Layer Normalization · Softmax