SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Jinwoo Park; Seunggeun Cho; Dongsu Han

arXiv:2505.17052·cs.CL·November 19, 2025

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Jinwoo Park, Seunggeun Cho, Dongsu Han

PDF

1 Video

TL;DR

SpecEdge is a scalable, cost-effective framework that leverages edge and server GPUs with speculative decoding to improve LLM serving efficiency and reduce latency.

Contribution

It introduces a novel edge-assisted inference framework that splits workloads, overlaps token creation with verification, and interleaves requests for better throughput.

Findings

01

Achieves 2.22x server throughput increase.

02

Enhances cost efficiency by 1.91x.

03

Reduces inter-token latency by 11.24%.

Abstract

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs· slideslive