Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
Danny Reidenbach, Zhonglin Cao, Zuobai Zhang, Kieran Didi, Tomas Geffner, Guoqing Zhou, Jian Tang, Christian Dallago, Arash Vahdat, Emine Kucukbenli, Karsten Kreis

TL;DR
This paper introduces a new high-quality synthetic protein dataset and a unified framework that significantly enhances the diversity and co-designability of de novo protein design models by aligning synthetic sequences with structures.
Contribution
The authors created a novel synthetic dataset aligned with experimental favorability and structural recoverability, improving state-of-the-art protein design models' diversity and co-designability.
Findings
+54% in structural diversity for La-Proteina
+27% in co-designability for La-Proteina
+73% in structural diversity for Proteina Atomistica
Abstract
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often include unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteina, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteina…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Materials Science · Enzyme Structure and Function
