EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients

He-Yen Hsieh; Hong Wang; H. T. Kung

arXiv:2512.00670·cs.AI·December 2, 2025

EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients

He-Yen Hsieh, Hong Wang, H. T. Kung

PDF

Open Access 3 Reviews

TL;DR

This paper introduces EDIT, a method that adaptively terminates diffusion-based large language model inference by monitoring training-gradient-derived reasoning stability, significantly reducing inference steps while maintaining accuracy.

Contribution

We propose a novel inference-time criterion using training-gradient dynamics to efficiently stop diffusion processes in dLLMs, reducing computation without sacrificing performance.

Findings

01

Reduces diffusion steps by up to 68.3%

02

Maintains or improves accuracy across reasoning benchmarks

03

Adds minimal storage overhead (~0.02%)

Abstract

Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete. We propose EDIT (Early Diffusion Inference Termination), an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected. EDIT monitors the alignment between token activations and a reasoning map derived from AdamW-aggregated LoRA updates captured during supervised fine-tuning (SFT). During training, optimization dynamics generate rich metadata about parameter importance that in prior methods is typically discarded upon model release. We preserve this information as a compact representation of learned reasoning pathways. During inference, alignment scores are converted to a distribution over the tokens already unmasked at the current denoising step, and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1.The method is clear, novel and well-supported by theory. 2.Without changing the model's reasoning structure, it enables the addition during training and no modification during inference, which differentiates it from previous efficiency-enhancing methods that require modifying the decoding or architecture. 3.The paper is easy to read, motivations are well connected.

Weaknesses

1.The reliance on training metadata requires access to the optimization trajectory during the training phase, which is insufficient for scenarios involving closed-source models or those with only final checkpoint files. 2.Validation focuses on LoRA-SFT, it remains unclear how EDIT performs with full-parameter finetuning, other adapters, or different optimizers 3.Small accuracy dips appear in longer sequences, and the paper does not deeply analyze when EDIT stabilizes too early.

Reviewer 02Rating 6Confidence 3

Strengths

* Clear definition of the AdamW evolution signal and why LoRA-B is preferred, with sparsity metrics and visualizations. * Matched-support KL and multi-step TV bounds yield simple certificates and a PAC-style calibration rule. * Minimal storage and implementation overhead, with complexity lower than attention and integration as a wrapper at inference. * Certified early stop rates reported across tasks, indicating practical realizations of the theory.

Weaknesses

* Hyperparameter selection uses per-task validation sets and a grid over $(\\delta,\\Omega)$; robustness to mis-tuning or cross-task portability is not thoroughly analyzed. * GSM8K at length 512 shows a noticeable accuracy drop with EDIT, which deserves a short diagnostic beyond the brief discussion. Minor.

Reviewer 03Rating 4Confidence 3

Strengths

* **Insightful diagnostics.** Gradient-based “pseudo-gradient” alignment with SFT gradients; domain-wise breakdowns (e.g., GPQA subdomains) and LoRA-A vs. LoRA-B sparsity analyses justify design choices. * **Practical gains with tiny overhead.** Reported step reductions up to ~68% and a ~1.5 MB metadata footprint for a 32-block model are compelling for deployment.

Weaknesses

* **Metadata extraction.** EDIT relies on training-time metadata extraction. Many released checkpoints does not expose training recipe, does the selection of training recipe (dataset, hyperparam) affect extraction. * **Scope of validation.** Experiments center on a single dLLM family (LLaDA-8B) and five reasoning tasks on one hardware stack. It’s hard to assess robustness across models, sizes, datasets. * **Task-tuned thresholds.** ($\delta,\Omega$) are selected per task via validation for an a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning