Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput
Jingwei Song, Wanyi Chen, Xinyuan Song, Max, Chris Tong, Gufeng Chen, Tianyi Zhao, Eric Yang, Bill Shi, and Lynn Ai

TL;DR
This paper introduces Decentralized Speculative Decoding (DSD), a framework that leverages speculative decoding in distributed LLM inference to turn communication latency into increased throughput, achieving significant speedups without retraining.
Contribution
The paper proposes DSD, a novel decentralized speculative decoding framework with adaptive verification, reducing communication costs and improving inference speed in distributed LLM systems.
Findings
Achieves up to 2.56x speedup on HumanEval
Achieves up to 2.59x speedup on GSM8K
Reduces cross-node communication cost significantly
Abstract
Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in decentralized settings, where network latency often dominates compute, remains under-characterized. We present Decentralized Speculative Decoding (DSD), a plug-and-play framework for decentralized inference that turns communication delay into useful computation by verifying multiple candidate tokens in parallel across distributed nodes. We further introduce an adaptive speculative verification strategy that adjusts acceptance thresholds by token-level semantic importance, delivering an additional 15% to 20% end-to-end speedup without retraining. In theory, DSD reduces cross-node communication cost by approximately (N-1)t1(k-1)/k, where t1 is per-link latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Topic Modeling
