LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li

TL;DR
LLaDA 1.5 introduces a variance-reduction framework for preference optimization in large language diffusion models, significantly improving alignment and performance across multiple benchmarks.
Contribution
The paper presents VRPO, a theoretical and practical framework for reducing variance in ELBO-based preference optimization, enhancing model alignment.
Findings
LLaDA 1.5 outperforms its predecessor on multiple benchmarks.
VRPO significantly improves alignment quality.
LLaDA 1.5 shows competitive mathematical reasoning performance.
Abstract
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper studies a timely topic on how to align diffusion LLMs with preferential data. While the writing is not particularly polished, the paper remains understandable overall. The experiment is well designed show the efficacy of the proposal.
The proposed ideas in VRPO (i.e., optimal allocation + antithetic sampling) are not too surprising. (I think that the first item on sampling budget to increase the number of samples $n$ is too trivial and obvious to be credited to authors.) But since the authors theoretically and empirically demonstrate the effects of the techniques, this incremental contribution is somewhat justifiable. Beyond the novelty, I have a few concerns on the framing and structure of the paper. - The foremost one is h
The paper has a complete structure and is generally well-written.
1. The paper’s primary contribution lies in theoretically identifying the high-variance issue within the ELBO estimation as a key factor causing DPO’s instability in masked diffusion models (MDMs), and in proposing the VRPO framework to mitigate this variance. However, the main innovation resides in the problem formalization and attribution analysis rather than in the algorithmic design itself. The proposed variance-reduction techniques, while theoretically sound, are based on established statis
- The paper points out that ELBO-based DPO alignment introduces bias and variance coupling that degrades optimization stability. - Section 4.2 shows ablation study by each three components - Empirical results support the claim that the proposed adjustments improve training dynamics.
- The method is largely a combination of well-known variance reduction techniques, meaning methodological novelty is limited. - There are computation overhead from sampling increase. The paper admits this overhead yet does not convincingly show superiority under equal resource conditions - Ablation study is limited. Individual contributions of each component are not clearly quantified.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsDiffusion
