Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen

TL;DR
This paper introduces DiSE, a confidence estimation method for diffusion large language models that assesses output quality through sequence regeneration probabilities, enabling better self-evaluation and adaptive generation.
Contribution
The paper presents DiSE, a novel self-evaluation technique for dLLMs, and a flexible-length generation framework that improves quality assessment and output control.
Findings
DiSE correlates with semantic coherence and answer accuracy.
DiSE enables reliable uncertainty quantification.
The flexible-length generation adapts sequence length based on self-assessment.
Abstract
Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper proposes an effective self-evaluation method DiSE for dLLM that leverages regeneration probability. Compared to the iterative Monte Carlo baseline, which requires numerous forward passes, DiSE only needs a single forward pass. 2. The paper introduces a flexible-length generation framework built on DiSE, which directly addresses the fixed-length generation constraint that typically limits dLLMs. 3. Experimental results demonstrate the effectiveness of the proposed method. DiSE pro
1. The paper lacks experiments on a broader set of open-source dLLMs (e.g., Dream [1]) to sufficiently demonstrate the effectiveness and generalizability of the proposed DiSE. 2. The experimental details regarding conditional likelihood estimation needs to be further clarified. For example, it is unclear whether the response used in the likelihood estimation is the model-generated output or the ground-truth answer. 3. The experiments on flexible length generation lack comparison with other metho
1. DiSE is faster than MC evaluation on the likelihood, and can be used to compare answers to MCQ benchmarks, in a single forward pass, while MC evaluations require many iterations to approximate the true likelihood, and pick the most likely answer. 2. DiSE leads to higher accuracy on GPQA and Math benchmarks, and improves the RoC AUC, compared to MC integration *with few samples (1 or 32)*.
### Summary of the weaknesses 1. **Likelihood**: DiSE is not shown to estimate or bound the true data likelihood; the reported "gains" in likelihood are not clear vs MC bounds and AR perplexity. 2. **Factual inaccuracy on generation length**: claims that dLLMs require fixed lengths ignore semi‑autoregressive/variable‑length approaches explored in Llada, Plaid, MDLM. 3. **Insufficient citation/positioning**: closely related masked‑LM pseudo‑likelihood work [4-6] is not cited. The existence of the
1. The method is simple and easy to use. 2. The writing is clear and easy to follow.
1. The author should obtain dllm in other training methods, such as the effectiveness of DiSE in dream. 2. The author should explore the reasons why DiSE is feasible, rather than simply discovering this phenomenon. From the perspective of llada training, only the prediction of mask tokens will be supervised, while the logits generated by other known tokens are, intuitively speaking, invalid. If the author analyzes this phenomenon, the paper will be more convincing. 3. The author should show th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
