D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Tianyu Wu, Yu Yao, Zhenting Qi, Han Zheng, Zhuohan Wang, Haoran Ma, Lawrence Liao, Himabindu Lakkaraju, Ju Li, Yilun Du

TL;DR
D-PACE introduces a dynamic, position-aware training method for parallel speculative decoding in large language models, improving speed and draft length without altering inference procedures.
Contribution
It derives a novel loss function that adapts training weights based on position-specific acceptance limits, enhancing draft quality and speed.
Findings
Consistently improves speedup and emitted draft length across multiple benchmarks.
Achieves 2.3% training-time overhead without changing inference architecture.
Effective across various models, depths, and decoding temperatures.
Abstract
Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
