DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
Xiang Xia, Wuyang Zhang, Jiazheng Liu, Cheng Yan, Yanyong Zhang

TL;DR
DepCap introduces an adaptive, training-free framework for block-wise diffusion language model inference, improving speed without sacrificing quality by using influence signals for dynamic block partitioning and conflict detection.
Contribution
It proposes a novel adaptive, training-free method for block-wise DLM inference that enhances speed-quality trade-offs through influence-based block extension and conflict-aware parallel decoding.
Findings
Achieves up to 5.63× speedup without significant quality loss.
Applicable to various DLM architectures and compatible with existing KV-cache strategies.
Demonstrates improved inference efficiency across multiple benchmarks.
Abstract
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
