Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

TL;DR
This paper uncovers the importance of temporal dynamics in diffusion large language models and introduces methods to leverage intermediate predictions, significantly improving their accuracy across multiple benchmarks.
Contribution
The paper presents two novel techniques that exploit temporal consistency in diffusion LLMs, enhancing their performance without requiring additional training.
Findings
24.7% improvement on Countdown dataset using negative TSE reward
Absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, 25.3% on Countdown
Demonstrates the effectiveness of temporal dynamics in improving diffusion LLMs
Abstract
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces a new metric, Temporal Semantic Entropy (TSE), to quantify semantic fluctuation across denoising steps, revealing meaningful patterns in dLLMs. 2. Proposes both training-free (voting-based decoding) and training-based (RL with TSE reward) methods, offering complementary ways to leverage temporal consistency. 3. Demonstrates that TSE can serve as a soft reward signal for RL even in unlabeled scenarios, expanding applicability to broader settings.
1. In Table 2, combining accuracy reward and TSE sometimes leads to negative effects under RFT, indicating potential instability in multi-reward optimization. 2. After RFT, the model’s everPass@1 performance decreases for some tasks (MATH500 and SVAMP), as shown in Table 1 and S4. 3. Using TSE as an RL reward requires semantic clustering, which introduces extra computation overhead not discussed in the paper. 4. The paper lacks clarity about which decoding strategy is used in the analysis exp
S1 This paper proposes leveraging the rich intermediate hidden states generated during the diffusion process—not only for inference-time aggregation but also for post-training reinforcement. This breaks away from the conventional paradigm that focuses solely on the final denoised output and opens a new avenue for understanding and enhancing dLLMs. S2 The authors define Temporal Semantic Entropy to quantify semantic stability during the generation process and further incorporate it as a self-supe
W1 The proposed Temporal Self-Consistency Voting essentially adapts the self-consistency idea from Self-Consistency Improves Chain-of-Thought Reasoning in Language Models (Wang et al., 2022) to diffusion models. This adaptation lacks substantial algorithmic innovation, and the observed performance gains are smaller than those achieved in autoregressive models—raising doubts about the method’s unique contribution within the diffusion framework. W2 The proposed TSE metric is based on semantic clus
* This paper is well-written and easy to follow. * This paper handles an interesting phenomenon that is distinct for diffusion language models. The authors conducted a quantitative analysis of temporal oscillations and proposed practical guidelines for leveraging them to improve the reasoning performance of diffusion language models. * The experimental section of this paper is relatively abundant, with many ablations and extra analyses.
* The mathematical notation system is not rigorous and self-consistent. In equation 1, the authors tried to demonstrate the sampling algorithm of LLaDA-like diffusion language models. Although I understand what the authors were trying to convey, Eq. 1 is incorrect and confusing from a probabilistic perspective, and I strongly advise the authors to revise its presentation. Throughout the paper, the authors made no distinction between scalar values and tensors. For example, the $x_t$ in Eq. 1 repr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
