PoET: A generative model of protein families as sequences-of-sequences

Timothy F. Truong Jr; Tristan Bepler

arXiv:2306.06156·q-bio.QM·January 8, 2024·28 cites

PoET: A generative model of protein families as sequences-of-sequences

Timothy F. Truong Jr, Tristan Bepler

PDF

Open Access 1 Repo

TL;DR

PoET is a novel autoregressive transformer model that generates and scores protein sequences within families, leveraging sequences-of-sequences modeling to improve transfer learning, extrapolation, and variant prediction.

Contribution

PoET introduces a new transformer architecture that models protein families as sequences-of-sequences, enabling better transfer learning and sequence generation across diverse protein families.

Findings

01

Outperforms existing models in variant function prediction

02

Effective on small and large protein families

03

Capable of controllable protein sequence generation

Abstract

Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $P$ r $o$ tein $E$ volutionary $T$ ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenProteinAI/PoET
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Biomedical Text Mining and Ontologies

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings