LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Chenxing Wei; Jiazhen Kang; Hong Wang; Jianqing Zhang; Hao Jiang; Xiaolong Xu; Ningyuan Sun; Ying He; F. Richard Yu; Yao Shu; Bo Jiang

arXiv:2603.01563·cs.LG·March 3, 2026

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang

PDF

Open Access

TL;DR

LFPO introduces a likelihood-free optimization framework for masked diffusion models, improving accuracy and efficiency by directly optimizing denoising logits and reducing diffusion steps, thus enhancing reasoning and code generation tasks.

Contribution

LFPO presents a novel likelihood-free policy optimization method that bypasses likelihood approximation errors in diffusion models, enabling more accurate and faster inference.

Findings

01

Outperforms state-of-the-art baselines on reasoning and code benchmarks.

02

Reduces inference diffusion steps by approximately 20%.

03

Achieves more precise gradient estimation through geometric velocity rectification.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning