Residual Context Diffusion Language Models

Yuezhou Hu; Harman Singh; Monishwaran Maheswaran; Haocheng Xi; Coleman Hooper; Jintao Zhang; Aditya Tomar; Michael W. Mahoney; Sewon Min; Mehrdad Farajtabar; Kurt Keutzer; Amir Gholami; Chenfeng Xu

arXiv:2601.22954·cs.CL·February 2, 2026

Residual Context Diffusion Language Models

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

PDF

Open Access

TL;DR

Residual Context Diffusion (RCD) enhances diffusion-based language models by recycling discarded token information, significantly improving accuracy and efficiency across various benchmarks, especially on complex reasoning tasks.

Contribution

The paper introduces RCD, a novel module that leverages discarded token representations to improve diffusion language models without substantial computational costs.

Findings

01

RCD improves accuracy by 5-10 points on various benchmarks.

02

On AIME tasks, RCD nearly doubles baseline accuracy.

03

RCD reduces denoising steps by 4-5x at similar accuracy.

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Computational and Text Analysis Methods