DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Yaodan Xu; Sheng Zhou; Zhisheng Niu

arXiv:2604.20919·cs.IT·April 24, 2026

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Yaodan Xu, Sheng Zhou, Zhisheng Niu

PDF

TL;DR

DiP-SD introduces a distributed pipelined speculative decoding method for efficient large language model inference at the edge, optimizing throughput through parallelism and joint batching and draft-length optimization.

Contribution

It proposes a novel distributed pipelined speculative decoding framework that enhances edge LLM inference efficiency by joint optimization of batching and draft lengths.

Findings

01

Achieves up to 17.89x throughput over autoregressive decoding.

02

Attains 1.93x throughput improvement over greedy batching.

03

Effectively balances batching and draft-length decisions for high throughput.

Abstract

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft tokens are generated locally on devices and subsequently offloaded to a centralized edge server for batch verification. The key challenge is to sustain high throughput under coupled decisions of (i) batching and pipeline scheduling and (ii) per user draft token length. We propose DiP-SD, which exploits two complementary parallelism dimensions: device-level distributed drafting and phase-level draft-verify pipelining. We formulate a throughput-maximization objective, defined as the expected number of accepted tokens per unit time, and jointly optimize the number of batches, user-to-batch assignment, and integer draft lengths. To solve the resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.