WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Xiangchen Li; Jiakun Fan; Qingyuan Wang; Dimitrios Spatharakis; Saeid Ghafouri; Hans Vandierendonck; Deepu John; Bo Ji; Ali R. Butt; Dimitrios S. Nikolopoulos

arXiv:2601.11652·cs.DC·April 8, 2026

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Xiangchen Li, Jiakun Fan, Qingyuan Wang, Dimitrios Spatharakis, Saeid Ghafouri, Hans Vandierendonck, Deepu John, Bo Ji, Ali R. Butt, Dimitrios S. Nikolopoulos

PDF

TL;DR

WISP is a distributed LLM serving system that reduces waste and interference, balancing workload between edge devices and data centers through dynamic drafting and SLO-aware batching.

Contribution

It formalizes key bottlenecks in speculative LLM serving and introduces WISP, a system with components that improve efficiency and scalability at the edge and cloud interface.

Findings

01

WISP improves system capacity by up to 2.1x and 4.1x.

02

WISP increases system goodput by up to 1.94x and 3.7x.

03

It effectively balances workload and reduces resource waste.

Abstract

As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.