Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

Jinsong Li; Xiaoyi Dong; Yuhang Zang; Yuhang Cao; Jiaqi Wang; Dahua Lin

arXiv:2508.00819·cs.CL·August 19, 2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

PDF

Open Access 3 Reviews

TL;DR

DAEDAL introduces a training-free, dynamic length expansion method for diffusion large language models, improving efficiency and performance by adaptively adjusting response length during generation.

Contribution

The paper proposes DAEDAL, a novel training-free approach that enables dynamic, adaptive response length expansion in DLLMs, overcoming the static length limitation without additional training.

Findings

01

DAEDAL achieves comparable or better performance than fixed-length baselines.

02

It improves computational efficiency by increasing the effective token ratio.

03

The method effectively adapts response length to task complexity.

Abstract

Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The method seems effective and is very simple. The presentation was written to be reader-friendly, but I believe would benefit from including more details. For example, Algorithm 1 has a few more details than Fig 3. The results are reasonable, with easing the need for manual sequence length tuning.

Weaknesses

The main weakness is that the main baselines to compare to is block diffusion or other adaptive-length methods. The sell of variable-length diffusion would likely have to be a speed increase while preserving accuracy, or another point on the speed-accuracy Pareto frontier. It is possible that spec-decoded autoregressive models achieve better speeds at similar accuracy.

Reviewer 02Rating 6Confidence 3

Strengths

1. The goal of achieving adaptive-length decoding is promising and important to obtain better trade-off frontier between sample quality and latency. 2. The proposed approach is easy to follow and the method requires no additional training. 3. The presentation of the paper is clear and the empirical performance, specifically on the accuracy v.s. total tokens, is quite impressive. The ablation studies including hyperparameter sensitivity analysis are informative.

Weaknesses

1. The proposed idea of using confidence of predicting EOS token to determine the length of the sequence seems a bit heuristic. A more comprehensive evaluation should be conducted on more datasets and tasks to justify the effectiveness of such heuristic. 2. Efficiency metrics such as wall-clock time/latency is missing.

Reviewer 03Rating 6Confidence 3

Strengths

- The proposed method is simple and intuitively reasonable. Requiring no retraining is a plus - The proposed method demonstrates solid empirical improvements over the best-tuned fixed-length baselines on math reasoning and code generation benchmarks (e.g., MATH500), while generating effective tokens more efficiently. - This paper conducts a thorough analysis of key hyperparameters of the method, showing its robustness to different configurations.

Weaknesses

- While the paper shows in experiments that a combination of the two expansion stages gives the best performance, in principle, there still lacks a clear reason why both stages are necessary. It's natural to consider merging the first length adjustment stage into the second dynamic expansion stage. Interestingly, according to Table 2, stage 1 already contributes to most of the performance improvement. I think more investigation should be put into this. - The ablation study shows the method's rob

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis