DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang

TL;DR
DiffuCoder is a 7B diffusion-based language model for code generation that enhances decoding diversity, reduces autoregressive bias, and improves performance through a novel RL training scheme called coupled-GRPO.
Contribution
This work systematically analyzes the decoding behavior of diffusion LLMs for coding and introduces coupled-GRPO, a new RL training method that improves performance and reduces AR bias.
Findings
DiffuCoder outperforms baseline models on code benchmarks (+4.4%).
DiffuCoder exhibits unique decoding behaviors, such as adjustable causality and diverse generation order.
Coupled-GRPO reduces variance in training and enhances model performance.
Abstract
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token…
Peer Reviews
Decision·ICLR 2026 Poster
1. Proposes local/global AR-ness@k to compare dLLMs vs. AR LLMs across stages and modalities. 2. Higher temperature makes decoding less AR-like and raises pass@k; real sample trajectories are visualized. 3. Coupled-GRPO lowers variance via complementary masks and avoids semi-AR bias, yielding stable post-training gains. 4. From adaptation → mid-training → SFT → RL, with compute setups that aid reproducibility. 5. When halving steps (2× speed), performance drops less after GRPO.
1. Generalization to multi-language, multi-file, or agentic tasks is unclear. 2. Multi-stage large-scale training plus RL rollouts; dLLM GRPO takes ~2× AR’s wall time. 3. Even with variance reduction, the training still approximates token log-likelihoods. 4. Stage-1 at ~700B tokens hurts downstream performance, implying sensitivity to data quality and early stopping. 5. No λ>1 (multi-pair coupling) or alternative t-distributions; limited analysis of reward weighting and verifiers.
* Proposes the first full pipeline for large-scale masked diffusion models in code generation, demonstrating competitive performance with AR models. * Introduces coupled-GRPO, a principled, theoretically justified improvement over prior RL-based DLM training. * Deep behavioral analysis offers valuable insight into how diffusion decoding diverges from autoregressive patterns. * Experiments are comprehensive, including comparisons with prior DLMs, autoregressive baselines, and multiple ablations.
This work is strong across the board — in terms of methodological innovation, clarity of explanation, and thorough ablation studies. I did not find any substantive weaknesses, and I believe the paper easily meets the bar for poster acceptance at ICLR. However, given that the core method could likely generalize to domains beyond code (e.g., open-ended language modeling, multimodal reasoning), it would have been valuable to see some discussion of this potential in the main paper. Including such an
1- Provides the **first systematic measurement of AR-ness** in diffusion LLMs, with clear metrics and visualizations. 2- **Coupled-GRPO** is theoretically motivated (antithetic variates) and supported by formal variance-reduction proof. 3- Offers a **complete open-source training recipe** for a diffusion-based code LLM, potentially useful for the community. 4- Connects decoding order, sampling temperature, and parallelism, yielding practical insights for efficient generation. 5- Writing and
### **Incremental Analytical Novelty** The key observations from the AR-ness analysis (e.g., the effect of temperature, the difference between adapted and from-scratch models) confirm properties that are largely expected from the principles of masked diffusion. The contribution is in the measurement, not the discovery of new phenomena. I would suggest that the authors tone the it down in the contribution sections. ### **Modest and Inconsistent Algorithmic Gains** The improvements from Coupled-
Code & Models
- 🤗apple/DiffuCoder-7B-Basemodel· 1.1k dl· ♡ 291.1k dl♡ 29
- 🤗apple/DiffuCoder-7B-Instructmodel· 1.6k dl· ♡ 611.6k dl♡ 61
- 🤗apple/DiffuCoder-7B-cpGRPOmodel· 1.8k dl· ♡ 3161.8k dl♡ 316
- 🤗bachngo/DiffuCoder-7B-Q4KMmodel· 11 dl11 dl
- 🤗Mungert/DiffuCoder-7B-cpGRPO-GGUFmodel· 102 dl· ♡ 6102 dl♡ 6
- 🤗Mungert/DiffuCoder-7B-Instruct-GGUFmodel· 214 dl· ♡ 4214 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
