Aligning Diffusion Language Models via Unpaired Preference Optimization
Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

TL;DR
This paper introduces ELBO-KTO, a novel method for aligning diffusion language models to human preferences using unpaired data, combining ELBO surrogates with prospect-theoretic preference objectives, and demonstrates its effectiveness on multiple benchmarks.
Contribution
The paper presents ELBO-KTO, a new approach that enables unpaired preference optimization for diffusion language models, addressing the intractability of sequence likelihoods and high cost of pairwise data.
Findings
ELBO-KTO achieves high adjusted win rates on preference benchmarks.
The method performs comparably or better than the base model on various reasoning tasks.
Variance reduction techniques stabilize training gradients.
Abstract
Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
