ProGen2: Exploring the Boundaries of Protein Language Models
Erik Nijkamp, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali, Madani

TL;DR
ProGen2 introduces large-scale protein language models with up to 6.4 billion parameters trained on extensive protein datasets, achieving state-of-the-art results in sequence modeling, generation, and fitness prediction for protein design.
Contribution
The paper presents ProGen2, a suite of scaled-up protein language models trained on diverse datasets, demonstrating improved performance and insights into the importance of data distribution.
Findings
ProGen2 models achieve state-of-the-art performance in sequence modeling.
ProGen2 can generate novel viable protein sequences.
ProGen2 accurately predicts protein fitness without additional fine-tuning.
Abstract
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗AntibodyGeneration/fine-tuned-progen2-smallmodel· 18 dl· ♡ 718 dl♡ 7
- 🤗AntibodyGeneration/fine-tuned-progen2-largemodel· 8 dl· ♡ 78 dl♡ 7
- 🤗hugohrban/progen2-smallmodel· 16k dl· ♡ 216k dl♡ 2
- 🤗hugohrban/progen2-largemodel· 1.8k dl· ♡ 11.8k dl♡ 1
- 🤗hugohrban/progen2-mediummodel· 6.5k dl· ♡ 16.5k dl♡ 1
- 🤗hugohrban/progen2-basemodel· 24k dl· ♡ 924k dl♡ 9
- 🤗hugohrban/progen2-oasmodel· 550 dl550 dl
- 🤗hugohrban/progen2-BFD90model· 18 dl18 dl
- 🤗hugohrban/progen2-xlargemodel· 609 dl· ♡ 2609 dl♡ 2
- 🤗xinyuanzhu/progen2-xlargemodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · vaccines and immunoinformatics approaches · RNA and protein synthesis mechanisms
