Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

Xingyu Su; Xiner Li; Masatoshi Uehara; Sunwoo Kim; Yulai Zhao; Gabriele Scalia; Ehsan Hajiramezanali; Tommaso Biancalani; Degui Zhi; Shuiwang Ji

arXiv:2507.00445·cs.LG·March 3, 2026

Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji

PDF

3 Reviews

TL;DR

This paper introduces an iterative distillation framework for fine-tuning diffusion models to optimize arbitrary reward functions in biomolecular design, improving stability and efficiency over traditional RL methods.

Contribution

It proposes a novel off-policy distillation approach for reward-guided diffusion model fine-tuning, addressing stability and sample efficiency issues.

Findings

01

Outperforms RL-based methods in reward optimization

02

Demonstrates effectiveness across protein, small molecule, and DNA design

03

Enhances training stability and sample efficiency

Abstract

We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. **Principled Distillation Framework:** The paper introduces a principled distillation-based alternative to PPO-style fine-tuning, effectively addressing instability issues commonly seen in on-policy reinforcement learning for diffusion models. 2. **Handling Non-differentiable Rewards:** The proposed method applies to realistic biomolecular tasks where reward gradients are unavailable, bridging diffusion modeling with practical scientific design scenarios. 3. **Stability via Lazy Updates:**

Weaknesses

1. **Heuristic Value Approximation:** The soft-value approximation via a single forward reward prediction (Algorithm 1, line 8) may introduce bias or variance, but the paper lacks quantitative analysis of its impact. 2. **Limited Theoretical Rigor:** The forward-KL interpretation is intuitive yet only algebraically shown (Appendix B) without a formal derivation or stability proof. 3. **Comparisons on Differentiable Rewards:** For differentiable cases (e.g., DNA enhancer tasks), comparisons with

Reviewer 02Rating 4Confidence 4

Strengths

1. The motivation and the insight for the method design make sense, and the proposed solutions are simple and easy-to-implement. 2. It involves many biological systems, demonstrating the effectiveness and the robustness of the algorithm in different tasks. 3. VIDD unfied diffusion and value-weighted MLE with clear objective function and implementation, the framework could combine any type of non-differentiable reward functions, which is of high application range in biomolecule design.

Weaknesses

1. The most concerning point from me is that the papre did not prove the effectiveness of each component through ablation studies, which makes the Method section not solid. More results on different biological systems are needed. 2. Only limited baseline works were discussed and compared. Some baselines from discrete flow matching/diffusion on biological sequence design are of high correlation with this work [1-4]. 3. More strict theoretical proof/analysis are needed (See Questions). 4. Th

Reviewer 03Rating 4Confidence 4

Strengths

1.Effective Adaptation to Non-Differentiable Rewards: VIDD bypasses the need for reward gradients by approximating soft value functions using the diffusion model’s x₀ prediction. This allows it to directly optimize core non-differentiable rewards in biomolecular design, such as binding affinity scores from AlphaFold and docking scores from QuickVina2, filling a key gap in fine-tuning diffusion models for scientific applications. 2. Training Stability and Anti-Collapse Ability: Unlike on-policy R

Weaknesses

1. Unaddressed Impact of Value Function Approximation Errors: VIDD relies on the diffusion model’s x₀prediction to approximate soft value functions, but it does not analyze how errors in x₀prediction—such as those arising from long protein sequences or complex DNA structures —affect the distillation process. Additionally, it fails to compare this approximation method with alternatives like Monte Carlo sampling, leaving uncertainty about whether this is the optimal approach. 2. High Hyperparamete

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.