Causal Autoregressive Diffusion Language Model

Junhao Ruan; Bei Li; Yongjing Yin; Pengcheng Huang; Xin Chen; Jingang Wang; Xunliang Cai; Tong Xiao; JingBo Zhu

arXiv:2601.22031·cs.CL·January 30, 2026

Causal Autoregressive Diffusion Language Model

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu

PDF

Open Access

TL;DR

The paper introduces Causal Autoregressive Diffusion (CARD), a new framework that combines the training efficiency of autoregressive models with the fast inference of diffusion models, enabling efficient and parallel token generation.

Contribution

CARD unifies causal autoregressive training with diffusion inference using a novel causal diffusion reformulation and optimization techniques, improving efficiency and performance.

Findings

01

Outperforms existing discrete diffusion baselines.

02

Reduces training latency by 3 times compared to block diffusion.

03

Achieves ARM-level data efficiency with parallel generation.

Abstract

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare