Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen; Gaurav Sarkar; Yeonju Ro; Sharath Nittur Sridhar; Zhangyang Wang; Aditya Akella; Souvik Kundu

arXiv:2512.07173·cs.LG·April 20, 2026

Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

PDF

TL;DR

CadLLM is a training-free, confidence-aware method that dynamically adjusts generation parameters to significantly improve inference throughput of diffusion-based large language models without sacrificing accuracy.

Contribution

It introduces a novel, lightweight adaptive approach for diffusion-based LLMs that controls generation parameters based on confidence, enhancing throughput without retraining.

Findings

01

Achieves up to 2.28x throughput improvement over state-of-the-art methods.

02

Demonstrates effectiveness across four popular tasks.

03

Maintains competitive accuracy with increased efficiency.

Abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 1.1-2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.