DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

Amin Karimi Monsefi; Dominic Culver; Nikhil Bhendawade; Lokesh Boominathan; Manuel R. Ciosici; Yizhe Zhang; Irina Belousova

arXiv:2605.16342·cs.LG·May 19, 2026

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

PDF

TL;DR

This paper introduces DACA-GRPO, a novel method enhancing reinforcement learning for diffusion language models by addressing temporal credit assignment and bias in likelihood estimates, leading to significant performance improvements.

Contribution

The paper proposes DACA-GRPO, a lightweight, plug-and-play enhancement that improves policy optimization in diffusion language models by incorporating denoising progress scores and stratified likelihood estimation.

Findings

01

Achieves up to 5.6 percentage points improvement in math reasoning.

02

Improves code generation performance by 7.4 percentage points.

03

Significantly enhances constraint satisfaction and JSON schema adherence.

Abstract

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.