Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Haotian Ye, Huayu Chen, Yisong Yue, Stefano Ermon

TL;DR
This paper introduces a stepwise decomposition method for aligning discrete diffusion models with reward functions, improving efficiency and performance across sequence modeling tasks like DNA design, protein folding, and language modeling.
Contribution
It proposes a novel offline preference optimization framework that decomposes trajectory alignment into stepwise objectives, enhancing diffusion model training with arbitrary rewards.
Findings
Up to 12% improvement in DNA sequence activity prediction.
Enhanced GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct.
Demonstrated superiority across multiple sequence modeling domains.
Abstract
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well structured and clearly written. The overall presentation flows smoothly and is easy to follow. Most details are clear. 2. The experiments covered diverse relevant domains and are comprehensively evaluated. Beyond achieving the best rewards among all compared methods, the sequences generated by the propose method also maintain comparable naturalness (close to natural sequences, or low NLL) and diversity (entropy).
1. Missing parts in the related work section. Although the authors did experiments with some inference-time guidance methods in the experiments for DNA and protein, I think it is still worth to mention such baselines in the related work about diffusion alignment. The pros and cons of these two different alignment paradigms, i.e., adding guidance during the inference time without training diffusion models like TDS or CG, or fine-tuning diffusion model through preference optimization like SDPO, co
1. Side-stepping the trajectory-level nature of discrete diffusion fine-tuning is an important problem, and the proposed method is a useful step in this direction. 2. The work is well organized. 3. The preference alignment task in language modelling demonstrates scalability of the method to larger models. 4. The extension to iterative training is useful. 5. The ablations section is well-done, and clarifies important aspects of the method (namely its sensitivity to the sample size $N$). 6. The e
1. The method assumes access to $p_\theta(x_0|x_t)$ - however, in discrete diffusion, we only have access to a factorized approximation (or equivalently, the **mean**) (eg. see section 3 of (Shi et al., 2024)) of this posterior (through $f_\theta(x_t,t)$). Equation 12 conflates these two things as being equal. - This additionally calls into question the applicability of the theoretical analysis in Theorem 4.1 - does the result still hold for the factorized approximation of the posterior $p_
- Experiments on different domains demonstrate strong performance across different metrics. - The idea to leverage the intermediate step information for alignment of discrete diffusion models is well-motivated.
- Although claimed as a trajectory alignment algorithm to maximize the stepwise reward, the algorithm does not really use the stepwise reward. Instead, it only uses the reward of the clean sample $r(x\_0,c)$ . Actually, the trajectory alignment here is more like a simple decomposition of the joint distribution for the data itself, while irrelevant to the reward-tilted part. - In Theorem 4.1, the optimality of the trajectory alignment objective is achieved when the reward $\hat{r}(x_{0:T},c)$ is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Numerical Analysis Techniques
