LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu; Rongzhen Wang; Shen Nie; Xiaolu Zhang; Chunwei Wu; Jun Hu; Jun Zhou; Jianfei Chen; Yankai Lin; Ji-Rong Wen; Chongxuan Li

arXiv:2505.19223·cs.LG·October 14, 2025

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li

PDF

Open Access 1 Models 3 Reviews

TL;DR

LLaDA 1.5 introduces a variance-reduction framework for preference optimization in large language diffusion models, significantly improving alignment and performance across multiple benchmarks.

Contribution

The paper presents VRPO, a theoretical and practical framework for reducing variance in ELBO-based preference optimization, enhancing model alignment.

Findings

01

LLaDA 1.5 outperforms its predecessor on multiple benchmarks.

02

VRPO significantly improves alignment quality.

03

LLaDA 1.5 shows competitive mathematical reasoning performance.

Abstract

While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

This paper studies a timely topic on how to align diffusion LLMs with preferential data. While the writing is not particularly polished, the paper remains understandable overall. The experiment is well designed show the efficacy of the proposal.

Weaknesses

The proposed ideas in VRPO (i.e., optimal allocation + antithetic sampling) are not too surprising. (I think that the first item on sampling budget to increase the number of samples $n$ is too trivial and obvious to be credited to authors.) But since the authors theoretically and empirically demonstrate the effects of the techniques, this incremental contribution is somewhat justifiable. Beyond the novelty, I have a few concerns on the framing and structure of the paper. - The foremost one is h

Reviewer 02Rating 6Confidence 3

Strengths

The paper has a complete structure and is generally well-written.

Weaknesses

1. The paper’s primary contribution lies in theoretically identifying the high-variance issue within the ELBO estimation as a key factor causing DPO’s instability in masked diffusion models (MDMs), and in proposing the VRPO framework to mitigate this variance. However, the main innovation resides in the problem formalization and attribution analysis rather than in the algorithmic design itself. The proposed variance-reduction techniques, while theoretically sound, are based on established statis

Reviewer 03Rating 2Confidence 3

Strengths

- The paper points out that ELBO-based DPO alignment introduces bias and variance coupling that degrades optimization stability. - Section 4.2 shows ablation study by each three components - Empirical results support the claim that the proposed adjustments improve training dynamics.

Weaknesses

- The method is largely a combination of well-known variance reduction techniques, meaning methodological novelty is limited. - There are computation overhead from sampling increase. The paper admits this overhead yet does not convincingly show superiority under equal resource conditions - Ablation study is limited. Individual contributions of each component are not clearly quantified.

Code & Models

Models

🤗
GSAI-ML/LLaDA-1.5
model· 11k dl· ♡ 40
11k dl♡ 40

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsDiffusion