SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Xiangchen Li; Dimitrios Spatharakis; Saeid Ghafouri; Jiakun Fan; Hans Vandierendonck; Deepu John; Bo Ji; Dimitrios Nikolopoulos

arXiv:2506.09397·cs.DC·November 6, 2025

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, Dimitrios Nikolopoulos

PDF

Open Access

TL;DR

This paper presents SLED, a speculative decoding framework that enables efficient on-device inference of large language models by orchestrating local draft generation and server verification, significantly improving throughput and cost efficiency on edge devices.

Contribution

It introduces a novel edge-serving framework using speculative decoding to balance local draft generation and server verification, optimizing LLM inference on resource-constrained devices.

Findings

01

Achieves 2.2x system throughput on edge devices.

02

Provides 2.8x system capacity improvement.

03

Maintains model accuracy while enhancing efficiency.

Abstract

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression