ProtTrans: Towards Cracking the Language of Life's Code Through   Self-Supervised Deep Learning and High Performance Computing

Ahmed Elnaggar; Michael Heinzinger; Christian Dallago; Ghalia Rihawi,; Yu Wang; Llion Jones; Tom Gibbs; Tamas Feher; Christoph Angerer; Martin; Steinegger; Debsindhu Bhowmik; Burkhard Rost

arXiv:2007.06225·cs.LG·May 6, 2021·71 cites

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi,, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin, Steinegger, Debsindhu Bhowmik, Burkhard Rost

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that large-scale self-supervised language models trained on protein sequences can effectively capture biophysical features and outperform traditional methods in predicting protein structure and localization, without relying on evolutionary data.

Contribution

The study introduces a suite of protein language models trained on massive datasets using high-performance computing, achieving state-of-the-art results in structural and localization predictions without evolutionary information.

Findings

01

Protein embeddings capture biophysical features.

02

Models outperform state-of-the-art in secondary structure prediction.

03

Embeddings enable accurate localization and membrane classification.

Abstract

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

agemagician/ProtTrans
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Reliability and Analysis Research