RITA: a Study on Scaling Up Generative Protein Sequence Models

Daniel Hesslow; Niccol\'o Zanichelli; Pascal Notin; Iacopo Poli and; Debora Marks

arXiv:2205.05789·q-bio.QM·July 18, 2022·59 cites

RITA: a Study on Scaling Up Generative Protein Sequence Models

Daniel Hesslow, Niccol\'o Zanichelli, Pascal Notin, Iacopo Poli and, Debora Marks

PDF

Open Access 4 Repos 4 Models

TL;DR

This paper introduces RITA, a large-scale autoregressive model for protein sequences, demonstrating how increasing model size improves performance in protein prediction tasks and supporting accelerated protein design.

Contribution

The paper presents the first systematic study of size effects in autoregressive protein models and releases RITA models for community use.

Findings

01

Larger RITA models improve next amino acid prediction accuracy.

02

Scaling enhances zero-shot fitness prediction performance.

03

Model size correlates with better enzyme function prediction.

Abstract

In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Algorithms and Data Compression · Genomics and Phylogenetic Studies