Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Jingwei Song; Wanyi Chen; Xinyuan Song; Max; Chris Tong; Gufeng Chen; Tianyi Zhao; Eric Yang; Bill Shi; and Lynn Ai

arXiv:2511.11733·cs.DC·November 18, 2025

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Jingwei Song, Wanyi Chen, Xinyuan Song, Max, Chris Tong, Gufeng Chen, Tianyi Zhao, Eric Yang, Bill Shi, and Lynn Ai

PDF

Open Access

TL;DR

This paper introduces Decentralized Speculative Decoding (DSD), a framework that leverages speculative decoding in distributed LLM inference to turn communication latency into increased throughput, achieving significant speedups without retraining.

Contribution

The paper proposes DSD, a novel decentralized speculative decoding framework with adaptive verification, reducing communication costs and improving inference speed in distributed LLM systems.

Findings

01

Achieves up to 2.56x speedup on HumanEval

02

Achieves up to 2.59x speedup on GSM8K

03

Reduces cross-node communication cost significantly

Abstract

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in decentralized settings, where network latency often dominates compute, remains under-characterized. We present Decentralized Speculative Decoding (DSD), a plug-and-play framework for decentralized inference that turns communication delay into useful computation by verifying multiple candidate tokens in parallel across distributed nodes. We further introduce an adaptive speculative verification strategy that adjusts acceptance thresholds by token-level semantic importance, delivering an additional 15% to 20% end-to-end speedup without retraining. In theory, DSD reduces cross-node communication cost by approximately (N-1)t1(k-1)/k, where t1 is per-link latency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Topic Modeling