SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, Dimitrios Nikolopoulos

TL;DR
This paper presents SLED, a speculative decoding framework that enables efficient on-device inference of large language models by orchestrating local draft generation and server verification, significantly improving throughput and cost efficiency on edge devices.
Contribution
It introduces a novel edge-serving framework using speculative decoding to balance local draft generation and server verification, optimizing LLM inference on resource-constrained devices.
Findings
Achieves 2.2x system throughput on edge devices.
Provides 2.8x system capacity improvement.
Maintains model accuracy while enhancing efficiency.
Abstract
The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
