Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
Saba Ahmadi, Prasanna Parthasarathi, Yufei Cui

TL;DR
This paper introduces TraFL, a novel post-training method for diffusion language models that addresses trajectory locking, leading to improved performance across reasoning and code generation benchmarks.
Contribution
TraFL is a new trajectory-balance objective that enhances diffusion language models by promoting diverse solution paths and improving benchmark performance.
Findings
TraFL outperforms the base model in all evaluated benchmarks.
TraFL maintains improvements as sampling budget increases.
TraFL surpasses the base model on Minerva Math and LiveCodeBench evaluations.
Abstract
Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
