WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu; Minghua He; Shaoxun Zeng; Sijun Zhang; Linhao Zhang; Chuhan Wu; Wei Jia; Yuan Liu; Xiao Zhou; Jie Zhou

arXiv:2512.22737·cs.CL·December 30, 2025

WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou

PDF

Open Access 1 Models

TL;DR

WeDLM introduces a diffusion decoding framework using standard causal attention, enabling faster parallel inference in large language models while maintaining quality, outperforming optimized autoregressive engines in speed.

Contribution

The paper presents WeDLM, a novel diffusion decoding method that uses causal attention and topological reordering to achieve efficient, parallel, and high-quality language model inference.

Findings

01

Approaches 3x speedup on reasoning benchmarks.

02

Achieves up to 10x speedup in low-entropy generation.

03

Outperforms optimized autoregressive engines in deployment settings.

Abstract

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tencent/WeDLM-8B-Instruct
model· 1.8k dl· ♡ 311
1.8k dl♡ 311

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare