Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Michael Rottoli; Subhankar Roy; Stefano Paraboschi

arXiv:2605.04215·cs.LG·May 15, 2026

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Michael Rottoli, Subhankar Roy, Stefano Paraboschi

PDF

TL;DR

This paper introduces Predict-then-Diffuse, a framework that estimates response length to optimize compute resources in diffusion-based large language models, reducing costs while maintaining quality.

Contribution

It proposes an adaptive response length predictor and safety mechanism to enable compute-budgeted inference in diffusion LLMs, improving efficiency without sacrificing output quality.

Findings

01

Significantly reduces FLOP compared to default inference.

02

Robust to skewed data distributions.

03

Maintains output quality while optimizing compute resources.

Abstract

Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over the traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length causes output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.