ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi,, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin, Steinegger, Debsindhu Bhowmik, Burkhard Rost

TL;DR
This paper demonstrates that large-scale self-supervised language models trained on protein sequences can effectively capture biophysical features and outperform traditional methods in predicting protein structure and localization, without relying on evolutionary data.
Contribution
The study introduces a suite of protein language models trained on massive datasets using high-performance computing, achieving state-of-the-art results in structural and localization predictions without evolutionary information.
Findings
Protein embeddings capture biophysical features.
Models outperform state-of-the-art in secondary structure prediction.
Embeddings enable accurate localization and membrane classification.
Abstract
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Reliability and Analysis Research
