Proteina: Scaling Flow-based Protein Structure Generative Models

Tomas Geffner; Kieran Didi; Zuobai Zhang; Danny Reidenbach; Zhonglin; Cao; Jason Yim; Mario Geiger; Christian Dallago; Emine Kucukbenli; Arash; Vahdat; Karsten Kreis

arXiv:2503.00710·cs.LG·March 4, 2025·3 cites

Proteina: Scaling Flow-based Protein Structure Generative Models

Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin, Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash, Vahdat, Karsten Kreis

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

Proteina introduces a scalable flow-based model for protein backbone generation, leveraging hierarchical conditioning and advanced training techniques to produce diverse, long, and designable proteins with state-of-the-art performance.

Contribution

The paper presents Proteina, a large-scale flow-based protein generator with hierarchical conditioning, novel training strategies, and new metrics for evaluating protein structure generation.

Findings

01

Achieves state-of-the-art performance in de novo protein backbone design.

02

Generates diverse proteins up to 800 residues long.

03

Provides high-level control over secondary structures and fold-specific features.

Abstract

Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper introduces novel metrics that address previously omitted distribution-level aspects of protein generation, which is both valuable and innovative, allowing for a more comprehensive evaluation of model performance. Additionally, the scaling of both training data and model aligns with the evolution of the field of protein generation. 2. The paper proposes an innovative $t$ sampling method that effectively captures the unique characteristics of protein data. This is also the first appli

Weaknesses

In line 119, a partial derivative seems mistakenly written as a total derivative, and the divergence is incorrectly labeled as a gradient. I believe the right form of the continuity equation should be like $\partial p_t(\boldsymbol x_t)/\partial t=-\nabla_{\boldsymbol x_t}\cdot(p_t(\boldsymbol x_t)\boldsymbol u_t(\boldsymbol x_t))$. Additionally, the differential symbol should be formatted in upright type, as $\mathrm{d}$, to follow standard conventions.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is certainly well written and I do enjoy the reading. 2. The paper makes several very interesting yet important explorations and observations. For example, though AF3 already observes the Equivariant vs Non-equivariant properties, it would be nice to further explore the scalability with non-equivariant transformers; The auto guidance parts of generation also provides some new insights into the protein structure generation; Studying protein structure generation in scale is also an

Weaknesses

1. Though with a scaled structure, it would be better to understand the training in a more systematic way, e.g. scaling laws. The trained flow matching model in general could still obtain the corresponding likelihood generally, could Proteina also conduct a likelihood evaluation over the protein structures? Is it possible to study the scaling laws based on that? 2. The notation of Table 1 for models with different configs is not very clear which makes it hard to read and analyze. I also sugge

Reviewer 03Rating 5Confidence 3

Strengths

1. Very well-written paper and very easy to follow. 2. The authors show that large-scale non-equivariant flow models also succeed on unconditional protein structure generation.

Weaknesses

1. The authors claim to significantly outperform all previous works; however, evidence supporting this assertion is not found in the experimental results table. Excluding unconditional models, there are no direct competitors, and comparisons can only be made with unconditional results. Even if the bold results are accepted as outperforming based on FPSD, FS, fJSD, and TM-score metrics, this model exhibits the lowest diversity. 2. RFdiffusion, ESM3, and Genie 2 were trained on different datasets,

Code & Models

Repositories

NVIDIA-Digital-Bio/proteina
jaxOfficial

Videos

Proteina: Scaling Flow-based Protein Structure Generative Models· slideslive

Taxonomy

TopicsGenetics, Bioinformatics, and Biomedical Research

MethodsSparse Evolutionary Training