ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning
Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu

TL;DR
ProteinZero introduces an online reinforcement learning framework for protein design that continuously improves generative models by combining structural guidance, a novel ddG predictor, and diversity regularization, outperforming existing methods.
Contribution
It presents a scalable, self-improving RL approach for protein generation that balances multiple objectives and enhances design success rates without relying on labeled datasets.
Findings
Outperforms state-of-the-art baselines on CATH-4.3 benchmark
Reduces design failure rates by 36-48%
Achieves success rates above 90% across diverse folds
Abstract
Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation…
Peer Reviews
Decision·Submitted to ICLR 2026
ProteinZero's primary strength lies in its innovative concept of linking protein generation with self-improvement via online Reinforcement Learning. This paradigm enables continuous model iteration, effectively overcoming the limitations of static, supervised training data. The methodology demonstrates diverse and practical attempts within the online RL framework itself, including a novel embedding-level diversity regularizer, and features a well-designed approach to multi-objective optimization
1. Computational predictors like Pred-ddG and FoldX ΔΔG inherently have limited correlation with ground-truth wet-lab experimental stability metrics. Whether used as a reward model for optimization or as a metric for evaluation, reliance on these tools introduces an unavoidable systemic bias, meaning the model may optimize for computational artifacts rather than true biophysical stability. 2. The paper's review of related work is incomplete, which complicates the assessment of the necessity and
1. Many prior protein foundation model + RL post-training works focus on offline RL, but online RL makes more sense for tasks with reliable reward functions (e.g., self-consistency). I strongly support exploring online RL. 2. Strong results in designability metrics. Experiments are thorough and well-executed. Extensive ablation on reward functions and RL algorithms (offline vs. online).
1. Thermal stability reward is interesting but a bit unclear to me in both formulation and interpretation. 1. Validation: There’s no analysis showing how well the proposed reward correlates with true ddG values. 2. Formulation: Why introduce two models $p_\theta$ and $p_\varphi$ (prior) instead of using same model (like $\log p_\theta(y|x) - \log p_\theta(y)$ or $\log p_\varphi(y|x) - \log p_\varphi(y)$? Having both $p_\theta$ and $p_\varphi$ seems to make the optimization more complex
**Originality**: I like the idea of an online RL Framework for this problem. Furthermore, the authors incorporate multiple novel elements to the architecture, such as the ddG predicted, and the regulizer **Quality**: The results in the experiment section are very impressive, and carefully checked, through extra steps like the AlphaFold3 validation **Clarity**: The paper is very well explained. The text is well organized and clear. **Significance**: It is important to improve inverse folding,
- The structural designability reward relies solely on ESMFold. Although this is briefly discussed in the paper, and external validation is provided, over-optimization toward the biases or error modes of a single folding model remains a subtle risk. It would be interesting to add multiple oracles, to prevent such biases. - While the model is very interesting, the limitation of being used only for monomers, does impact its usefulness for drug design. - There is a growing trend in the field towar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Materials Science · vaccines and immunoinformatics approaches
